I have a code that I am using to try and filter out arrays that are missing values as seen here:
from astropy.table import Table
import numpy as np
data = '/home/myname/data.fits'
data = Table.read(data, format="fits")
ID = np.array(data['id'])
ID.astype(str)
redshift = np.array(data['z'])
redshift.astype(float)
radius = np.array(data['r'])
radius.astype(float)
mag = np.array(data['M'])
mag.astype(float)
def stack(array1, array2, array3, array4):
#stacks multiple arrays to have corresponding values next to eachother
stacked_array = [(array1[i], array2[i], array3[i], array4[i]) for i in range(0, array1.size)]
stacked_array = np.array(stacked_array)
return(stacked_array)
stacked = stack(ID, redshift, radius, mag)
filtered_array = np.array([])
for i in stacked:
if not i.any == 'nan':
np.insert(filtered_array, i[0], axis=0)
The last for loop is where i'm having difficulty. I want to insert the rows from my stacked array into my filtered array if it has all of the information (some rows are missing redshift, others are missing magnitude etc...). How would I be able to loop over my stacked array and filter out all of the rows that have all 4 values I want? I keep getting this error currently.
TypeError: _insert_dispatcher() missing 1 required positional argument: 'values'
So something like this?
a=[[1,2,3,4],[1,"nan",2,3]]
b=[i for i in a if not any(j=='nan' for j in i)]
which prints [[1, 2, 3, 4]].
You can switch:
for i in stacked:
if not i.any == 'nan':
np.insert(filtered_array, i[0], axis=0)
to:
def any_is_nan(col):
return len(list(filter(lambda x: x=='nan',col))) > 0
filtered_array = list(filter(lambda x: not any_is_nan(x),stacked))
Please refer to filter.
Related
I am trying to figure out a way to write an if statement based on a couple criteria. I have a large CSV file that I have cleaned and already organized. There are a couple things I need to do:
I first need a way that will check to see if the machine is "on" for more than 3 rows. If that is true then I need to get its corredsponding pressure for that cycle and find the average of it. For example, in the df aboce, in rows 14-19 the machine is on for more than 3 rows so then I need to get the average pressure across all instances in this period.
This data contains 40,000 rows, so I will then need this to cycle through the entire sheet to keep track of the total number of times the machine is on, and each corresponding average pressure.
import pandas as pd
import numpy as np
header_list = ['Time']
df = pd.read_csv('S8-1.csv' , skiprows=6 , names = header_list)
#splits the data into proper columns
df[['Date/Time','Pressure']] = df.Time.str.split(",,", expand=True)
#deletes orginal messy column
df.pop('Time')
#convert Pressure from object to numeric
df['Pressure'] = pd.to_numeric(df['Pressure'], errors = 'coerce')
#converts to a time
df['Date/Time'] = pd.to_datetime(df['Date/Time'], format = '%m/%d/%y %H:%M:%S.%f' , errors = 'coerce')
df['Moving Average'] = df['Pressure'].rolling(window=5).mean()
df['Rolling Average Center']= df['Pressure'].rolling(window=5, center=True).mean()
df['Machine On/Off'] = ['1' if x >= 115 else '0' for x in df['Rolling Average Center'] ]
arr = df['Machine On/Off']
def find_runs(x):
x = np.asanyarray(x)
if x.ndim !=1:
raise Valueerror('Only 1D array supported')
n = x.shape[0]
if n == 0:
return np.array([]), np.array([]), np.array([])
else:
loc_run_start = np.empty(n, dtype=bool)
loc_run_start[0] = True
np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
run_starts = np.nonzero(loc_run_start)[0]
# find run values
run_values = x[loc_run_start]
# find run lengths
run_lengths = np.diff(np.append(run_starts, n))
return run_values, run_starts, run_lengths
run = find_runs(arr)
df.iloc[_start:run_length]['whatever column']
Suggested first step: make new column with ones and zeros... 1 for on, 0 for off.
df['newcolumnname'] = 0
df['newcolumnname'][df['machine on/off'] == 'on'] = 1
Grab that column as a numpy array:
arr = df['newcolumnname'].to_numpy()
Then using the following code credit: https://gist.github.com/alimanfoo/c5977e87111abe8127453b21204c1065
import numpy as np
def find_runs(x):
"""Find runs of consecutive items in an array."""
# ensure array
x = np.asanyarray(x)
if x.ndim != 1:
raise ValueError('only 1D array supported')
n = x.shape[0]
# handle empty array
if n == 0:
return np.array([]), np.array([]), np.array([])
else:
# find run starts
loc_run_start = np.empty(n, dtype=bool)
loc_run_start[0] = True
np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
run_starts = np.nonzero(loc_run_start)[0]
# find run values
run_values = x[loc_run_start]
# find run lengths
run_lengths = np.diff(np.append(run_starts, n))
return run_values, run_starts, run_lengths
get the "runs"...should only have zeros and ones.
run_values, run_starts, run_lengths = find_runs(arr)
print(run_values) #just to see what order they are in
print(run_starts.shape)#get an idea of the shapes
print(run_lengths.shape)
_ix = run_values.tolist().index(1)
To get data from your pandas,
for _start, _run_length in zip(run_starts[_ix], run_lengths[_ix]):
tmp_df = df.iloc[_start:_start+_run_length]
#do what you want
print(tmp_df)
break #remove this...I just don't know how big your data is
You're correct when you are thinking "this answer does not run". But it should give you enough to get going and solve your problem.
Incomplete question gets an incomplete response.
TODO: make your machine on/off to be integer...not sure if it is absolutely needed, but if it's a numerical numpy array then there are more options available to you.
df['Machine On/Off'] = [1 if x >= 115 else 0 for x in df['Rolling Average Center'] ]
testcolumn is filled with strings of numbers and np.nan values.
I am trying to find the mean of the numerical values only.
The code does not filter out the np.nan values so I don't get the correct values.
columnCount = 0
columnMean = 0.0
for x in testcolumn:
if x != np.nan:
print(x)
columnMean = float(x) + columnMean
columnCount = columnCount + 1
columnMean = columnMean/columnCount
Use numpy.nanmean with astype(float):
import numpy as np
arr = np.array(['1','2',np.nan])
np.nanmean(arr.astype(float))
Output:
1.5
I am new to programming and had a question. If I had two numpy arrays:
A = np.array([[1,0,3], [2,6,5], [3,4,1],[4,3,2],[5,7,9]], dtype=np.int64)
B = np.array([[3,4,5],[6,7,9],[1,0,3],[4,5,6]], dtype=np.int64)
I want to compare the last two columns of array A to the last two columns of array B, and then if they are equal, output the entire row to a new array. So, the output of these two arrays would be:
[1,0,3
1,0,3
5,7,9
6,7,9]
Because even though the first element does not match for the last two rows, the last two elements do.
Here is my code so far, but it is not even close to working. Can anyone give me some tips?
column_two_A = A[:,1]
column_two_B = B[:,1]
column_three_A = A[:,2]
column_three_B = B[:,2]
column_four_A = A[:,3]
column_four_B = B[:,3]
times = A[:,0]
for elementA in column_three_A:
for elementB in column_three_B:
if elementA == elementB:
continue
for elementC in column_two_A:
for elementD in column_two_B:
if elementC == elementD:
continue
for elementE in column_four_A:
for elementF in column_four_B:
if elementE == elementF:
continue
element.append(time)
print(element)
Numpy holds many functions for that kind of tasks. Here is a solution to check if the values of A are in B. Add print() statements and check what chk, chk2 and x are.
import numpy as np
A = np.array([[1,0,3], [2,6,5], [3,4,1],[4,3,2],[5,7,9]], dtype=np.int64)
B = np.array([[3,4,5],[6,7,9],[1,0,3],[4,5,6]], dtype=np.int64)
c = []
for k in A:
chk = np.equal(k[-2:], B[:, -2:])
chk2 = np.all(chk, axis=1)
x = (B[chk2, :])
if x.size:
c.append(x)
print(c)
I think I figured it out by staying up all night... thank you!
`for i in range(len(A)):
for j in range(len(B)):
if A[i][1] == B[j][1]:
if A[i][2] == B[j][2]:
print(B[j])
print(A[i])`
Question: How could I peform the following task more efficiently?
My problem is as follows. I have a (large) 3D data set of points in real physical space (x,y,z). It has been generated by a nested for loop that looks like this:
# Generate given dat with its ordering
x_samples = 2
y_samples = 3
z_samples = 4
given_dat = np.zeros(((x_samples*y_samples*z_samples),3))
row_ind = 0
for z in range(z_samples):
for y in range(y_samples):
for x in range(x_samples):
row = [x+.1,y+.2,z+.3]
given_dat[row_ind,:] = row
row_ind += 1
for row in given_dat:
print(row)`
For the sake of comparing it to another set of data, I want to reorder the given data into my desired order as follows (unorthodox, I know):
# Generate data with desired ordering
x_samples = 2
y_samples = 3
z_samples = 4
desired_dat = np.zeros(((x_samples*y_samples*z_samples),3))
row_ind = 0
for z in range(z_samples):
for x in range(x_samples):
for y in range(y_samples):
row = [x+.1,y+.2,z+.3]
desired_dat[row_ind,:] = row
row_ind += 1
for row in desired_dat:
print(row)
I have written a function that does what I want, but it is horribly slow and inefficient:
def bad_method(x_samp,y_samp,z_samp,data):
zs = np.unique(data[:,2])
xs = np.unique(data[:,0])
rowlist = []
for z in zs:
for x in xs:
for row in data:
if row[0] == x and row[2] == z:
rowlist.append(row)
new_data = np.vstack(rowlist)
return new_data
# Shows that my function does with I want
fix = bad_method(x_samples,y_samples,z_samples,given_dat)
print('Unreversed data')
print(given_dat)
print('Reversed Data')
print(fix)
# If it didn't work this will throw an exception
assert(np.array_equal(desired_dat,fix))
How could I improve my function so it is faster? My data sets usually have roughly 2 million rows. It must be possible to do this with some clever slicing/indexing which I'm sure will be faster but I'm having a hard time figuring out how. Thanks for any help!
You could reshape your array, swap the axes as necessary and reshape back again:
# (No need to copy if you don't want to keep the given_dat ordering)
data = np.copy(given_dat).reshape(( z_samples, y_samples, x_samples, 3))
# swap the "y" and "x" axes
data = np.swapaxes(data, 1,2)
# back to 2-D array
data = data.reshape((x_samples*y_samples*z_samples,3))
assert(np.array_equal(desired_dat,data))
I am trying to get the row and column number, which meets three conditions in Pandas DataFrame.
I have a DataFrame of 0, 1, -1 (bigger than 1850); when I try to get the row and column it takes forever to get the output.
The following is an example I have been trying to use:
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.randint(2, size=(1845,1850)))
b = pd.DataFrame(np.random.randint(2, size=(5,1850)))
b[b == 1] = -1
c = pd.concat([a,b], ignore_index=True)
column_positive = []
row_positive = []
column_negative = []
row_negative = []
column_zero = []
row_zero = []
for column in range(0, c.shape[0]):
for row in range(0, c.shape[1]):
if c.iloc[column, row] == 1:
column_positive.append(column)
row_positive.append(row)
elif c.iloc[column, row] == -1:
column_negative.append(column)
row_negative.append(row)
else:
column_zero.append(column)
row_zero.append(row)
I did some web searching and found that np.where() does something like this, but I have no idea how to do it.
Could anyone tell a better alternative?
You are right np.where would be one way to do it. Here's an implementation with it -
# Extract the values from c into an array for ease in further processing
c_arr = c.values
# Use np.where to get row and column indices corresponding to three comparisons
column_zero, row_zero = np.where(c_arr==0)
column_negative, row_negative = np.where(c_arr==-1)
column_positive, row_positive = np.where(c_arr==1)
If you don't mind having rows and columns as a Nx2 shaped array, you could do it in a bit more concise manner, like so -
neg_idx, zero_idx, pos_idx = [np.argwhere(c_arr == item) for item in [-1,0,1]]