Taking mean along columns with masks in Python - python

I have a 2D array containing data from some measurements. I have to take mean along each column considering good data only.
Hence I have another 2D array of the same shape which contains 1s and 0s showing whether data at that (i,j) is good or bad. Some of the "bad" data can be nan as well.
def mean_exc_mask(x, mas): #x is the real data arrray
#mas tells if the data at the location is good/bad
sum_array = np.zeros(len(x[0]))
avg_array = np.zeros(len(x[0]))
items_array = np.zeros(len(x[0]))
for i in range(0, len(x[0])): #We take a specific column first
for j in range(0, len(x)): #And then parse across rows
if mas[j][i]==0: #If the data is good
sum_array[i]= sum_array[i] + x[j][i]
items_array[i]=items_array[i] + 1
if items_array[i]==0: # If none of the data is good for a particular column
avg_array[i] = np.nan
else:
avg_array[i] = float(sum_array[i])/items_array[i]
return avg_array
I am getting all values as nan!
Any ideas of what's going on wrong here or someother way?

The code seems to work for me, but you can do it a whole lot simpler by using the build-in aggregation in Numpy:
(x*(m==0)).sum(axis=0)/(m==0).sum(axis=0)
I tried it with:
x=np.array([[-0.32220561, -0.93043128, 0.37695923],[ 0.08824206, -0.86961453, -0.54558324],[-0.40942331, -0.60216952, 0.17834533]])
and
m=array([[1, 1, 0],[1, 0, 0],[1, 1, 1]])
If you post example data, it is often easier to give a qualified answer.

Related

How do you make a pandas dataframe be filled randomly with 1s and 0s?

I have a
pandas dataframe in python that I would like to fill with 1s and 0s based on random choices.
If possible, I'd like it to be done with numpy. I've tried
tw['tw'] = np.random.choice([0, 1])
but this ends up just giving me a dataframe filled with either a 1 or a 0. I know that this is possible using a for loop:
for i in range(len(tw)):
tw['tw'][i] = np.random.choice([0, 1])
But this feels inefficient. How might I go about doing this?
If you assign a scalar, the value will be broadcasted to all indices.
You need to assign an array of the size of the Series.
Use the size parameter of numpy.random.choice:
tw['tw'] = np.random.choice([0, 1], size=len(tw))
You can use numpy's random integer generator
tw['tw'] = np.random.randint(low = 0,high = 2,size = len(tw))
Note that the "high" number is non-inclusive, so you'd have to give 2.

How to extract elements in specific column of the dataset?

i have been trying to build a neural network,to do so i have to divide the data into x and y,(my dataset was converted to numpy).
The data in the "x" is the 1st column which i have extracted successfully but when i try to extract the 2nd column i get the both x and y values for "y".
Here the code i used to divide the data:
data=np.genfromtxt("/home/crpsm/Pycharm/DataSet/headbrain.csv",delimiter=',')
x=data[:,:1]
y=data[:, :2]
Heres the output of x and y:
x:-
[[3738.]
[4261.]
[3777.]
[4177.]
[3585.]
[3785.]
[3559.]
[3613.]
[3982.]
[3443.]
y:-
[[3738. 1297.]
[4261. 1335.]
[3777. 1282.]
[4177. 1590.]
[3585. 1300.]
[3785. 1400.]
[3559. 1255.]
[3613. 1355.]
[3982. 1375.]
[3443. 1340.]
please tell me how to fix this error.Thanks in Advance..!!!
You may want to review the numpy indexing documentation.
To get the second column in the same shape as x, use y=data[:, 1:2].
Note: you are creating 2d arrays with this indexing (shape of (len(data), 1)). If you want 1d arrays, just use integers, not slices, for the second term:
x = data[:, 0]
y = data[:, 1]
What #w-m said in their answer is correct, you are currently assigning all rows (the first :) and all columns, starting from zero up to column one, excluding the upper bound, to x (with :1) and all rows (again the first :) and all columns, starting from zero up to column two, excluding the upper bound, to y (with :2).
x = data[:, 0]
y = data[:, 1]
Is one way to do this properly, but a nicer and more succinct way would be to use tuple unpacking:
x, y = data.T
This transposes (`T) the data, i.e. the two dimensions are exchanged, after which the first dimension has length two. If your actual data has more columns than that, you can use :
x, y, *rest = data.T
In this case rest will be a list of the remaining columns. This syntax was introduced in Python 3.0.

Pandas - expanding inverse quantile function

I have a dataframe of values:
df = pd.DataFrame(np.random.uniform(0,1,(500,2)), columns = ['a', 'b'])
>>> print df
a b
1 0.277438 0.042671
.. ... ...
499 0.570952 0.865869
[500 rows x 2 columns]
I want to transform this by replacing the values with their percentile, where the percentile is taken over the distribution of all values in prior rows. i.e., if you do df.T.unstack(), it would be a pure expanding sample. This might be more intuitive if you think of the index as a DatetimeIndex, and I'm asking to take the expanding percentile over the entire cross-sectional history.
So the goal is this guy:
a b
0 99 99
.. .. ..
499 58 84
(Ideally I'd like to take the distribution of a value over the set of all values in all rows before and including that row, so not exactly an expanding percentile; but if we can't get that, that's fine.)
I have one really ugly way of doing this, where I transpose and unstack the dataframe, generate a percentile mask, and overlay that mask on the dataframe using a for loop to get the percentiles:
percentile_boundaries_over_time = pd.DataFrame({integer:
pd.expanding_quantile(df.T.unstack(), integer/100.0)
for integer in range(0,101,1)})
percentile_mask = pd.Series(index = df.unstack().unstack().unstack().index)
for integer in range(0,100,1):
percentile_mask[(df.unstack().unstack().unstack() >= percentile_boundaries_over_time[integer]) &
(df.unstack().unstack().unstack() <= percentile_boundaries_over_time[integer+1])] = integer
I've been trying to get something faster to work, using scipy.stats.percentileofscore() and pd.expanding_apply(), but it's not giving the correct output and I'm driving myself insane trying to figure out why. This is what I've been playing with:
perc = pd.expanding_apply(df, lambda x: stats.percentileofscore(x, x[-1], kind='weak'))
Does anyone have any thoughts on why this gives incorrect output? Or a faster way to do this whole exercise? Any and all help much appreciated!
As several other commenters have pointed out, computing percentiles for each row likely involves sorting the data each time. This will probably be the case for any current pre-packaged solution, including pd.DataFrame.rank or scipy.stats.percentileofscore. Repeatedly sorting is wasteful and computationally intensive, so we want a solution that minimizes that.
Taking a step back, finding the inverse-quantile of a value relative to an existing data set is analagous to finding the position we would insert that value into the data set if it were sorted. The issue is that we also have an expanding set of data. Thankfully, some sorting algorithms are extremely fast with dealing with mostly sorted data (and inserting a small number of unsorted elements). Hence our strategy is to maintain our own array of sorted data, and with each row iteration, add it to our existing list and query their positions in the newly expanded sorted set. The latter operation is also fast given that the data is sorted.
I think insertion sort would be the fastest sort for this, but its performance will probably be slower in Python than any native NumPy sort. Merge sort seems to be the best of the available options in NumPy. An ideal solution would involve writing some Cython, but using our above strategy with NumPy gets us most of the way.
This is a hand-rolled solution:
def quantiles_by_row(df):
""" Reconstruct a DataFrame of expanding quantiles by row """
# Construct skeleton of DataFrame what we'll fill with quantile values
quantile_df = pd.DataFrame(np.NaN, index=df.index, columns=df.columns)
# Pre-allocate numpy array. We only want to keep the non-NaN values from our DataFrame
num_valid = np.sum(~np.isnan(df.values))
sorted_array = np.empty(num_valid)
# We want to maintain that sorted_array[:length] has data and is sorted
length = 0
# Iterates over ndarray rows
for i, row_array in enumerate(df.values):
# Extract non-NaN numpy array from row
row_is_nan = np.isnan(row_array)
add_array = row_array[~row_is_nan]
# Add new data to our sorted_array and sort.
new_length = length + len(add_array)
sorted_array[length:new_length] = add_array
length = new_length
sorted_array[:length].sort(kind="mergesort")
# Query the relative positions, divide by length to get quantiles
quantile_row = np.searchsorted(sorted_array[:length], add_array, side="left").astype(np.float) / length
# Insert values into quantile_df
quantile_df.iloc[i][~row_is_nan] = quantile_row
return quantile_df
Based on the data that bhalperin provided (offline), this solution is up to 10x faster.
One final comment: np.searchsorted has options for 'left' and 'right' which determines whether you want your prospective inserted position to be the first or last suitable position possible. This matters if you have a lot of duplicates in your data. A more accurate version of the above solution will take the average of 'left' and 'right':
# Query the relative positions, divide to get quantiles
left_rank_row = np.searchsorted(sorted_array[:length], add_array, side="left")
right_rank_row = np.searchsorted(sorted_array[:length], add_array, side="right")
quantile_row = (left_rank_row + right_rank_row).astype(np.float) / (length * 2)
Yet not quite clear, but do you want a cumulative sum divided by total?
norm = 100.0/df.a.sum()
df['cum_a'] = df.a.cumsum()
df['cum_a'] = df.cum_a * norm
ditto for b
Here's an attempt to implement your 'percentile over the set of all values in all rows before and including that row' requirement. stats.percentileofscore seems to act up when given 2D data, so squeezeing seems to help in getting correct results:
a_percentile = pd.Series(np.nan, index=df.index)
b_percentile = pd.Series(np.nan, index=df.index)
for current_index in df.index:
preceding_rows = df.loc[:current_index, :]
# Combine values from all columns into a single 1D array
# * 2 should be * N if you have N columns
combined = preceding_rows.values.reshape((1, len(preceding_rows) *2)).squeeze()
a_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'a'],
kind='weak'
)
b_percentile[current_index] = stats.percentileofscore(
combined,
df.loc[current_index, 'b'],
kind='weak'
)

How to declare and fill an array in NumPy?

I need to create an empty array in Python and fill it in a loop method.
data1 = np.array([ra,dec,[]])
Here is what I have. The ra and dec portions are from another array I've imported. What I am having trouble with is filling the other columns.
Example. Lets say to fill the 3rd column I do this:
for i in range (0,56):
data1[i,3] = 32
The error I am getting is:
IndexError: invalid index for the second line in the aforementioned
code sample.
Additionally, when I check the shape of the array I created, it will come out at (3,). The data that I have already entered into this is intended to be two columns with 56 rows of data.
So where am I messing up here? Should I transpose the array?
You could do:
data1 = np.zeros((56,4))
to get a 56 by 4 array. If you don't like to start the array with 0, you could use np.ones or np.empty or np.ones((56, 4)) * np.nan
Then, in most cases it is best not to python-loop if not needed for performance reasons.
So as an example this would do your loop:
data[:, 3] = 32
data1 = np.array([ra,dec,[32]*len(ra)])
Gives a single-line solution to your problem; but for efficiency, allocating an empty array first and then copying in the relevant parts would be preferable, so you avoid the construction of the dummy list.
One thing that nobody has mentioned is that in Python, indexing starts at 0, not 1.
This means that if you want to look at the third column of the array, you actually should address [:,2], not [:,3].
Good luck!
Assuming ra and dec are vectors (1-d):
data1 = np.concatenate([ra[:, None], dec[:, None], np.zeros((len(ra), 1))+32], axis=1)
Or
data1 = np.empty((len(ra), 3))
data[:, 0] = ra
data[:, 1] = dec
data[:, 2] = 32
hey guys if u want to fill an array with just the same number just
x_2 = np.ones((1000))+1
exemple for 1000 numbers 2

Using numpy.argmax() on multidimensional arrays

I have a 4 dimensional array, i.e., data.shape = (20,30,33,288). I am finding the index of the closest array to n using
index = abs(data - n).argmin(axis = 1), so
index.shape = (20,33,288) with the indices varying.
I would like to use data[index] = "values" with values.shape = (20,33,288), but data[index] returns the error "index (8) out of range (0<=index<1) in dimension 0" or this operation takes a relatively long time to compute and returns a matrix with a shape that doesn't seem to make sense.
How do I return a array of correct values? i.e.,
data[index] = "values" with values.shape = (20,33,288)
This seems like a simple problem, is there a simple answer?
I would eventually like to find index2 = abs(data - n2).argmin(axis = 1), so I can perform an operation, say sum data at index to data at index2 without looping through the variables. Is this possible?
I am using python2.7 and numpy version 1.5.1.
You should be able to access the maximum values indexed by index using numpy.indices():
x, z, t = numpy.indices(index.shape)
data[x, index, z, t]
If I understood you correctly, this should work:
numpy.put(data, index, values)
I learned something new today, thanks.

Categories