I have a Numpy array that is created as follows
data=np.zeros(500,dtype='float32, (50000,2)float32')
This array is filled with values that I acquire from some measurements, and is supposed to reflect that during each time point (room for 500 time points) we can acquire 50.000 x- and y- coords.
Later in my code is use a bisect-like search for which I need to know howmany X-coords (measurement points) are actually in my array which I originally did with np.count_nonzero(data), this yielded the following problem:
Fake data:
1 1
2 2
3 0
4 4
5 0
6 6
7 7
8 8
9 9
10 10
the non zero count returns 18 values here, the code then goes into the bisect-like search using data[time][1][0][0] as min X-coord and data[time][1][(np.count_nonzero(data)][0] as max x-coord which results in the array stopping at 9 instead of 10.
I could use a while loop to manually count non-zero values (in the X-coord column) in the array but that would be silly, I assume that there is some builtin numpy functionality for this. My question is then what builtin functionality or modification of my np.count_nonzero(data) I need since the documentation doesn't offer much information in that regards (link to numpy doc).
-- Simplified question --
Can I use Numpy functionality to count the non-zero values for a singular column only? (i.e. between data[time][1][0][0] and data[time][1][max][0] )
Maybe a better approach would be to filter the array using nonzero and iterate over the result:
nonZeroData = data[np.nonzero(data[time][1])]
To count zeros only from the second column:
nonZeroYCount = np.count_nonzero(data[time][1][:, 1])
If I understand you correctly, to select elements from data[time][1][0][0] to data[time][1][max][0]:
data[time][1][:max+1,0]
EDIT:
To count all non-zero for every time:
(data["f1"][:,:,0] != 0).sum(1)
Why not consider using data != 0 to get the bool matrix?
You can use:
stat = sum(data != 0) to count the non-zero entries.
I am not sure what shape your data array has but hope you can see what I mean. :)
Related
I am trying to do the following with NumPy array or normal array:
For push the data I am doing:
ar1 = []
#Read from Pandas dataframe column. i is row number of data - it's working fine.
ar1.append((df['rolenumber'][i]))
OUTPUT:
[34768, 34739, 34726, 34719, 34715]
This result possible to come as Ascending/Descending or combined anything possible.
Here I want to take the last 3 values to validate whether it is ascending or descending or mixed.
Ascending: If the last 3 values increased regular. Example: 34726, 34739, 34745
Descending: If the last 3 values decrease properly. Example: 34726, 34719, 34715
Mixed: If the last 3 start with a big number then small number then big number. Example: 34726, 34719, 34725
Note: No need to sort only validate.
This little snippet should get you going:
a = np.array([34768, 34739, 34726, 34719, 34715])
is_descending = np.all(np.diff(a[-3:]) < 0)
is_ascending = np.all(np.diff(a[-3:]) > 0)
is_mixed = ~(is_ascending | is_descending)
I want to find all the combinations of a binary matrix (ones and zeros) of size 18 x 9 where each row is equal to 5 and each column is equal to 10.
Also each block must have a 1 in each column.
The total number of combinations of that grid size is... well, too much to iterate over:
2 ** (18 x 9) combinations = 5,846,006,549,323,611,672,814,739,330,865,132,078,623,730,171,904
Although there are only 9!/(5!4!)=126 combinations of rows to make a row equal 5. With 18 rows, that's still a lot 64,072,225,938,746,379,480,587,511,979,135,205,376
However, each block must have at least a 1 in each column which must limit the number of combinations.
I wonder if I can break it down in to block combinations so it's potentially 6 blocks of 9 columns... which is then only 18,014,398,509,481,984 (obviously didn't factor in the work to work out the blocks first)
I figure the power of numpy has the ability but I can't work it out.
I have done a couple of examples in Excel by hand
Binary matrix with row and column sum constraint.
solve(4^3^2^x - 2^162 == 0, x)
I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T
count 716865 716873 716884 716943
0 -0.16029615828413712 -0.07630309240006158 0.11220663712532133 -0.2726775504078691
1 -0.6687265363491811 -0.6135022705188075 -0.49097425130988914 -0.736020384028633
2 0.06735205699309535 0.07948417451634422 0.09240256047258057 0.0617964313591086
3 0.372935701728449 0.44324822316416074 0.5625073287879649 0.3199599294007491
4 0.39439310866886124 0.45960496068147993 0.5591549439131621 0.34928093849248304
5 -0.08007381002566456 -0.021313801077641505 0.11996141286735541 -0.15572679401876433
6 0.20853071107951396 0.26561990841073535 0.3661990387594055 0.15720649076873264
7 -0.0488049712326824 0.02909288268076153 0.18643283476719688 -0.1438092892727158
8 0.017648470149950992 0.10136455179350337 0.2722686729095633 -0.07928001803992157
9 0.4693208827819954 0.6601182040950377 1.0 0.2858790498612906
10 0.07597883305423633 0.0720868097090368 0.06089458880790768 0.08522329510499728
I want to manipulate this normalized dataframe to do something similar to the .corr method python has built in but want to modify it. I want to create my own method for correlation and build a heatmap which I know how to do.
My end result is a dataframe which will be NxN with 0 or 1 values that meets criterias below. In the table I show above it will be 4x4.
The following steps are the criteria for my correlation method:
Loop through each column as the reference and subtract all the other columns from it.
As we loop I also want to disregard absolute values if both the reference and the correlating column have normalized values of less than 0.2.
For the remaining, if the difference values are less than 10 percent, it means the correlations is good and I start building it with 1 for positive correlation and 0 if any of the difference of the count values is greater than 10%.
all the diagonals will have a 1 for good correlation to each other and the other cells will have either 0 or 1.
The following is what I have but when I drop the deadband values, it does not catch all for some reason.
subdf = []
deadband = 0.2
for i in range(len(df2_norm.columns)):
# First, let's drop non-zero above deadband values in each row
df2_norm_drop = df2_norm.drop(df2_norm[(df2_norm.abs().iloc[:,i] < deadband) & \
(df2_norm.abs().iloc[:,i] > 0)].index)
# Take difference of first detail element normalized value to chart allowable
# normalized value
subdf.append(pd.DataFrame(df2_norm.subtract(df2_norm.iloc[:,i], axis =0)))
I know it looks a lot but would really appreciate any help. Thank you!
So, I'm a python newbie looking for someone with an ideia on how to optimize my code. I'm working with a spreadsheet with over 6000 rows, and this portion of my code seems really ineficient.
for x in range(0,len(df):
if df.at[x,'Streak_currency'] != str(df.at[x,'Currency']):
df.at[x, 'Martingale'] = df.at[x-1, 'Martingale'] + (df.at[x-1, 'Martingale'] )/64
x+=1
if df.at[x,'Streak_currency'] == str(df.at[x,'Currency']):
x+=1
It can take upwards of 8 minutes run.
With my limited knowledge, I only manage to change my df.loc for df.at, and it helped a lot. But I st
UPDATE
In this section of the code, I'm trying to apply a function based on a previous value until a certain condition is met, in this case,
df.at[x,'Streak_currency'] != str(df.at[x,'Currency']):
I really don't know why this iteration is taking so long. In theory, it should only look at a previous value and apply the function. Here is a sample of the output:
Periodo Currency ... Agrupamento Martingale
0 1 GBPUSD 1 1.583720 <--- starts aplying a function over and over.
1 1 GBPUSD 1 1.608466
2 1 GBPUSD 1 1.633598
3 1 GBPUSD 1 1.659123
4 1 GBPUSD 1 1.685047
5 1 GBPUSD 1 1.711376 <- stops aplying, since Currency changed
6 1 EURCHF 2 1.256550
7 1 USDCAD 3 1.008720 <- starts applying again until currency changes
8 1 USDCAD 3 1.024481
9 1 USDCAD 3 1.040489
10 1 GBPAUD 4 1.603080
Pandas lookups like df.at[x,'Streak_currency'] are not efficient. Indeed, for each evaluation of this kind of expression (multiple time per loop iteration), pandas fetch the column regarding its name and then fetch the value in a list.
You can avoid this computation cost by just storing the columns in variables before the loop. Additionally, you can put the column in numpy array so the value can be fetch in a more efficient way (assuming all the value have the same type).
Finally, using string conversions and string comparisons on integers are not efficient. They can be avoided here (assuming the integers are not unreasonably big).
Here is an example:
import numpy as np
streakCurrency = np.array(df['Streak_currency'], dtype=np.int64)
currency = np.array(df['Currency'], dtype=np.int64)
martingale = np.array(df['Martingale'], dtype=np.float64)
for x in range(len(df)):
if streakCurrency[x] != currency[x]:
martingale[x] = martingale[x-1] * (65./64.)
x+=1
if streakCurrency[x] == currency[x]:
x+=1
# Update the pandas dataframe
df['Martingale'] = martingale
This should at least an order of magnitude faster.
Please note that the second condition is useless since the compared values cannot be equal and different at the same times (this may be a bug in your code)...