I've seen questions adjacent to this answered a number of times, but I'm really, really new to Python, and can't seem to get those answers to work for me...
I'm trying to access every row in an np array, where both columns have values greater than 1.
So, if x is my original array, and x has 500 rows and 2 columns, I want to know which rows, of those 500, contain 2 values > 1.
I've tried a bunch of solutions, but the following two seem the closest:
Test1 = x[(x[:,0:1] > 1) & (x[:,1:2] > 1)]
# Where the first condition should look for values greater than 1 in the first column, and the second condition should look for values greater than 1 in the second column.
Test2 = np.where(x[:,0:1] > 1 & x[:,1:2] > 1)
Any help would be greatly appreciated! Thanks so much!
Related
I have a numpy series of numbers:
arr = np.array([1147.8, 1067.2, 957.6, 826.4])
And a pandas DF, with two columns, 'right' and 'left', that describe a range, whereas each range is contained in the next one in the DF:
right left
0 1090 1159.5
1 1080 1169.5
2 1057.5 1191.99
For each number in arr, I would like to get the index of the first range containing it. For the first number (1147.8), it's gonna be 0, since it's in the range (1090, 1159.5). For the second one, it's gonna be 2, since 1067.2 in (1057.5, 1191.99) but not in (1080, 1169.5) (and, of course, the other previous ranges)
I could iterate the DF for each number in arr, but I'm looking for a smarter way.
Thanks
Full cross-product between arr and df, then filter, then select first range. That's ok to do for small amounts of data. Ideally, you would do that all at once for all 2000 arrs. With around 2 million rows for the DataFrame after .merge(df_arr, how='cross'), the approach would still work in that case.
df_arr = pd.DataFrame({"arr": arr,
"id_arr": range(len(arr))})
(df.reset_index()
.merge(df_arr, how='cross')
.query("right < arr < left")
.groupby("id_arr")
.first())
Produces:
index right left arr
id_arr
0 0 1090.0 1159.50 1147.8
1 2 1057.5 1191.99 1067.2
Where index is the index of the tightest range.
The id_arr is used for grouping in case you have duplicate values in arr and you expect duplicate values in the results. If that's not relevant, one could also group by arr directly.
I am learning pandas and numpy on python. I was trying to apply conditional statements to my DataFrame and I encountered a ValueError due to shape mismatch. Please kindly help me to understand why, thank you!
Here is a look of my simple dataset:
I was trying to filter the DataFrame if the following conditions are met:
area > 8 and area < 10
Here is the result that I have received:
The results are fine if I print the condition out individually and I couldn't understand why can't the matrix converge to form a single DataFrame.
The problem is here: brics[brics['area'] > 8] and brics[brics['area'] < 10].
The inner expression in both cases produces a 5-element vector. Both of them have the same shape. The first has 4 trues and 1 false, the second has 3 trues and 2 falses. But when you do brics[xxx], that selects a subset. brics[xxx] where xxx has 4 trues produces a (4,4) matrix. brics[xxx] where xxx has 3 trues produces a (3,3) matrix. You can't combine those.
The KEY is that you want to combine these BEFORE you use them as indexes:
x = brics[ np.logical_and( brics['area'] > 8, brics['area'] < 10 ) ]
And by the way, you made this much harder for us than it should have been because you posted an image instead of code we could cut and paste.
I'm trying to subset a column of values that were extracted from a correlation matrix. I want to get values greater than 0.75 and less than -0.75. I tried the first line of code and it only gave me positive values greater than 0.75. The second line of code error'd out without a result.
Corr_matrix1 = Corr_matrix1[(Corr_matrix1['Coefficient'] >= abs(0.75))]
Corr_matrix1 = Corr_matrix1 [(Corr_matrix1 ['Coefficient'] >= 0.75) & (Corr_matrix1 ['Coefficient'] <= -0.75)]
Any help would be appreciated.
You can do this with the DataFrame.query method, one of my favorite features of pandas and it's pretty slept on. Here's an example;
df.corr().query(
'Coefficient <= -0.75'
'or Coefficient >= 0.75'
)
It's kind of odd, you pass the arguments as strings without commas in between multiple arguments. If you use a variable, you can use an f string.
Take a look at Interval Index
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.html
I have a numpy array of 4000*6 (6 column). And I have a numpy column (1*6) of minimum values (made from another numpy array of 3000*6).
I want to find everything in the large array that is below those values. but each value to it's corresponding column.
I've tried the simple way, based on a one column solution I already had:
largearray=[float('nan') if x<min_values else x for x in largearray]
but sadly it didn't work :(.
I can do a for loop for each column and each value, but i was wondering if there is a faster more elegant solution.
Thanks
EDIT: I'll try to rephrase: I have 6 values, and 6 columns.
i want to find the values in each column that are lower then the corresponding one from the 6 values.
by array I mean a 2d array. sorry if it wasn't clear
sorry, i'm still thinking in Matlab a bit.
this my loop solution. It's on df, not numpy. still, is there a faster way?
a=0
for y in dfnames:
df[y]=[float('nan') if x<minvalues[a] else x for x in df[y]]
a=a+1
df is the large array or dataframe
dfnames are the column names i'm interested in.
minvalues are the minimum values for each column. I'm assuming that the order is the same. bad assumption, but works for now.
will appreciate any help making it better
I think you just need
result = largearray.copy()
result[result < min_values] = np.nan
That is, result is a copy of largearray but ay element less than the corresponding column of min_values is set to nan.
If you want to blank entire rows only when all entries in the row are less than the corresponding column of min_values, then you want:
result = largearray.copy()
result[np.all(result < min_values, axis=1)] = np.nan
I don't use numpy, so it may be not commont used solution, but such work:
largearray = numpy.array([[1,2,3], [3,4,5]])
minvalues =numpy.array([3,4,5])
largearray1=[(float('nan') if not numpy.all(numpy.less(x, min_values)) else x) for x in largearray]
result should be: [[1,2,3], 'nan']
I have three DataFrames that are all the same shape ~(1,000, 10,000).
original has ~20-100 non-zero values per row - very sparse
input is a copy of original, with 10 random non-zero values per row changed to zero
output is populated completely with non-zero values
I am now attempting to compare original and output only in the positions where input and output are different (i.e. just in the 10 randomly chosen positions)
Firstly, I create a df of only these elements of original with everything else set to zero:
maskedOriginal = original.where(original != input, other=0)
This is created in seconds. I then attempt to do the same for output:
maskedOutput = output.where(original != input, other=0)
However, since this is now working with 3 DataFrames, it is far too slow - I haven't even got a result after a couple of minutes. Is there any more suitable way to do this?
Use numpy.where with DataFrame contructor:
arr = original.values
maskedOriginal = pd.DataFrame(np.where(arr != input, arr, 0),
index=original.index,
columns=original.columns)