Doing column math with numpy in python - python

I am looking for coding examples to learn Numpy.
Usage would be dtype ='object'.
To construnct array the code used would
a= np.asarray(d, dtype ='object')
not np.asarray(d) or np.asarray(d, dtype='float32')
Is sorting any different than float32/64?
Coming from excel "cell" equations, wrapping my head around Row Column math.
Ex:
A = array([['a',2,3,4],['b',5,6,2],['c',5,1,5]], dtype ='object')
[['a',2,3,4],
['b',5,6,2],
['c',5,1,5]])
Create new array with:
How would I sort high to low by [3].
How calc for entire col. (1,1)- (1,0), Example without sorting A
['b',3],
['c',0]
How calc for enitre array (1,1) - (2,0) Example without sorting A
['b',2],
['c',-1]

Despite the fact that I still cannot understand exactly what you are asking, here is my best guess. Let's say you want to sort A by the values in 3rd column:
A = array([['a',2,3,4],['b',5,6,2],['c',5,1,5]], dtype ='object')
ii = np.argsort(A[:,2])
print A[ii,:]
Here the rows have been sorted according to the 3rd column, but each row is left unsorted.
Subtracting all of the columns is a problem due to the string objects, however if you exclude them, you can for example subtract the 3rd row from the 1st by:
A[0,1:] - A[2,1:]
If I didn't understand the basic point of your question, then please revise it. I highly recommend you take a look at the numpy tutorial and documentation if you have not done so already:
http://docs.scipy.org/doc/numpy/reference/
http://docs.scipy.org/doc/numpy/user/

Related

How do I append new data to the DataArray along existing dimension in-place?

I have data grouped into a 3-D DataArray named 'da' with dimensions 'time', 'indicators', and 'coins', using Dask as a backend:
I need to select the data for peculiar indicator, calculate a new indicator based on it, and append this newly calculated indicator to da along indicators dimension using the new indicator name (let's call it daily_return). In somewhat simplistic terms of a 2-D analogy, I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
So far I've tried to apply_ufunc() with both drop=False (then I retrieve scalar indicators coordinate on the resulting DataArray) and drop=True (respectively, indicators are dropped) using the corresponding tutorial:
dr_func = lambda today, yesterday: today / yesterday - 1 # Assuming for simplicity that yesterday != 0
today_da = da.sel(indicators="price_daily_close_usd", drop=False) # or drop=True
yesterday_da = da.shift(time=1).sel(indicators="price_daily_close_usd", drop=False) # or drop=True
dr = xr.apply_ufunc(
dr_func,
today_da,
yesterday_da,
dask="allowed",
dask_gufunc_kwargs={"allow_rechunk": True},
vectorize=True
)
Obviously, in case of drop=True I cannot concat da and dr DataArrays, since indicators are not present among dr's coordinates.
In its turn, in case of drop=False I've managed to concat these DataArrays along indicators; however, the resulting indicators coord would contain two similarly named CoordinateVariables, specifically "price_daily_close_usd":
...while the second of them should be renamed into "daily_return".
I've also tried to extract the needed data from dr through .sel(), but failed due to the absence of index along indicators dimension (as far as I've understood, it's not possible to set an index in this case, since this dimension is scalar):
dr.sel(indicators="price_daily_close_usd") # Would result in KeyError: "no index found for coordinate 'indicators'"
Moreover, the solution above is not done in-place - i.e. it creates a new combined DataArray instance instead of modifying da, while the latter would be highly preferable.
How can I append new data to da along existing dimension, desirably in-place?
Loading all the data directly into RAM would hardly be possible due to its huge volumes, that's why Dask is being used.
I'm also not sticking to the DataArray data structure and it would be no problem to switch to a Dataset if it has more suitable methods for solving my problem.
Xarray does not support appending in-place. Any change to the shape of your array will need to produce a new array.
If you want to work with a single array and know the size of the final array, you could generate an empty array and assign values based on coordinate labels.
I need to perform something like calculating a new pandas DataFrame column based on its other columns, but in 3-D.
Xarray's Dataset is a better analog to the Pandas.Dataframe. The Dataset is a dict-like container storing ND arrays (DataArray's) just like the Dataframe is a dict-like container storing 1D arrays (Series).

Repeat calculations for every row of dataframe

The following calculations were for the 1st row, i.e., train_df.y1[0].
I want to repeat this operation for all 400 rows of train_df
squared_deviations_y1_0_train = ((ideal_df.loc[:0,"y1":"y50"] - train_df.y1[0]) ** 2).sum(axis=1)
The result is correct, just need to repeat it.
Since your end result seems to be a scalar, you can convert both of these dataframes to Numpy and take advantage of braodcasting.
Something like this,
squared_deviations = ((ideal_df.to_numpy() - train_df.y1.to_numpy().reshape(-1,1)) ** 2).sum(axis=1)
would do pretty nicely. If you MUST stay within pandas, you could use the subtract() method to get the same outcome.
(train_df.y1.subtract(ideal_df.T) ** 2).sum(axis=0)
Not that train_df.y1 becomes a row vector of size (400,) so you need to make the row dimension 400 to do this subtraction (hence the transpose of ideal_df).
You can also use the apply() method as Barmar suggested. This will require you to define a function that calculates the row index so that you can subtract the appropriate value of train_df for every cell before you perform the square and sum operations. Something like this,
(ideal_df.apply(lambda cell: cell - train_df.y1[cell.index]) ** 2).sum(axis=1)
would also work. I highly recommend using Numpy for these tasks because Numpy was designed with broadcasting in mind, but as shown you can get away with doing it in Pandas.

Compare each column to a different value efficiently

I have a numpy array of 4000*6 (6 column). And I have a numpy column (1*6) of minimum values (made from another numpy array of 3000*6).
I want to find everything in the large array that is below those values. but each value to it's corresponding column.
I've tried the simple way, based on a one column solution I already had:
largearray=[float('nan') if x<min_values else x for x in largearray]
but sadly it didn't work :(.
I can do a for loop for each column and each value, but i was wondering if there is a faster more elegant solution.
Thanks
EDIT: I'll try to rephrase: I have 6 values, and 6 columns.
i want to find the values in each column that are lower then the corresponding one from the 6 values.
by array I mean a 2d array. sorry if it wasn't clear
sorry, i'm still thinking in Matlab a bit.
this my loop solution. It's on df, not numpy. still, is there a faster way?
a=0
for y in dfnames:
df[y]=[float('nan') if x<minvalues[a] else x for x in df[y]]
a=a+1
df is the large array or dataframe
dfnames are the column names i'm interested in.
minvalues are the minimum values for each column. I'm assuming that the order is the same. bad assumption, but works for now.
will appreciate any help making it better
I think you just need
result = largearray.copy()
result[result < min_values] = np.nan
That is, result is a copy of largearray but ay element less than the corresponding column of min_values is set to nan.
If you want to blank entire rows only when all entries in the row are less than the corresponding column of min_values, then you want:
result = largearray.copy()
result[np.all(result < min_values, axis=1)] = np.nan
I don't use numpy, so it may be not commont used solution, but such work:
largearray = numpy.array([[1,2,3], [3,4,5]])
minvalues =numpy.array([3,4,5])
largearray1=[(float('nan') if not numpy.all(numpy.less(x, min_values)) else x) for x in largearray]
result should be: [[1,2,3], 'nan']

Assigning data to a multidimensional NumPy array

I'm trying to assign data to a Numpy array from a Pandas dataframe. The Numpy array is actually a netCDF variable with a slightly odd structure that I can't modify, to maintain compatibility.
The first dimension is the data (unlimited), while the second dimension is a reference to a given sensor (each sensorProcValue variable might have up to 12 sensors). For each sensor, I need to assign a Pandas Series (from a DataFrame).
My attempt:
sensorProcValue[:, current_sensor_index] = cnv.data[cnv_short_name][:]
This results in an error, which doesn't surprise me. The following works, but is obviously extremely inefficient:
for i in range(0, len(cnv.data[cnv_short_name])):
sensorProcValue[i, current_sensor_index] = cnv.data[cnv_short_name][i]
The shape of the data is as follows:
print(np.shape(sensorProcValue))
>>> (0, 12)
print(np.shape(cnv.data[cnv_short_name]))
>>> (40828,)
I'm sure I'm missing something obvious here (I'm still getting my head around Pandas/NumPy data structures), so any pointers would be much appreciated.

Normalising rows in numpy matrix

I am trying to normalize rows of a numpy matrix using L2 norm (unity length).
I am seeing a problem when I do that.
Assuming my matrix 'b' is as follows:
Now when I do the normalization of first row as below it works fine.
But when I try to do it by iterating through all the rows and converting the same matrix b as below it gives me all zeros.
Any idea why is that happening and how to get the correct normalization?.
Any faster way of row normalizing of matrix without having to iterate each row?. I don't want to use sci-kit learn normalization function though.
Thanks
The problem comes from the fact that b has type int so when you fill in row by row, numpy automatically converts the results of you computation (float) to int, hence the zeros. One way to avoid that is to define b with type float by using 0., 1. etc... or just adding .astype(float) at definition.
This should work to do the computation in one go which also doesn't require converting to float first:
b = b / np.linalg.norm(b, axis=1, keepdims=True)
This works because you are redefining the whole array rather than changing its rows one by one, and numpy is clever enough to make it float.

Categories