Seems that I have a wrong index somewhere, but cannot spot it - python

I am new to python and having trouble with a function. It should delete rows of a (N,10) matrix (imported fron a file) where -1 appears. This is the code
import pandas as pd
import numpy as np
def load(name, f):
file = pd.read_csv(name,header=None)
totalMatrix = np.array(file)
if f == 'forward':
for i in range(len(totalMatrix)):
for j in range(10):
if totalMatrix[i,j] ==-1:
if i > 0:
totalMatrix[i,j]=totalMatrix[i-1,j]
else:
print("Warning")
f = 'drop'
elif f == 'drop':
for i in range(len(totalMatrix)): # or np.size(totalMatrix[:, 0])
for j in range(10):
if totalMatrix[i,j] == -1 :
totalMatrix = np.delete(totalMatrix, (i), axis=0)
t = totalMatrix[:, 0:6]
d = totalMatrix[:, 6:11]
return t, d
But I keep running on this error:
line 38, in load
if totalMatrix[i,j] == -1 :
IndexError: index 2 is out of bounds for axis 0 with size 2
I have tried to look several places on internet, but could not find an answer, neither could I find the error myself. Can anybody see what is wrong and tell me?

It doesn't work because the matrix is getting smaller and you keep iterating based on the old size, i.e. if the totalMatrix has 3 rows in the beginning and you delete one, the last iteration will try to get a nonexisting row.
While iterating on the Matrix, gather the indices you want to delete. Afterward, you can delete them at once.
toDelete = []
for i in range(len(totalMatrix)): # or np.size(totalMatrix[:, 0])
for j in range(10):
if totalMatrix[i,j] == -1 :
toDelete.append(i)
totalMatrix = np.delete(totalMatrix, i, axis=0)

Related

Pandas: If statments with multiple criteria

I am trying to figure out a way to write an if statement based on a couple criteria. I have a large CSV file that I have cleaned and already organized. There are a couple things I need to do:
I first need a way that will check to see if the machine is "on" for more than 3 rows. If that is true then I need to get its corredsponding pressure for that cycle and find the average of it. For example, in the df aboce, in rows 14-19 the machine is on for more than 3 rows so then I need to get the average pressure across all instances in this period.
This data contains 40,000 rows, so I will then need this to cycle through the entire sheet to keep track of the total number of times the machine is on, and each corresponding average pressure.
import pandas as pd
import numpy as np
header_list = ['Time']
df = pd.read_csv('S8-1.csv' , skiprows=6 , names = header_list)
#splits the data into proper columns
df[['Date/Time','Pressure']] = df.Time.str.split(",,", expand=True)
#deletes orginal messy column
df.pop('Time')
#convert Pressure from object to numeric
df['Pressure'] = pd.to_numeric(df['Pressure'], errors = 'coerce')
#converts to a time
df['Date/Time'] = pd.to_datetime(df['Date/Time'], format = '%m/%d/%y %H:%M:%S.%f' , errors = 'coerce')
df['Moving Average'] = df['Pressure'].rolling(window=5).mean()
df['Rolling Average Center']= df['Pressure'].rolling(window=5, center=True).mean()
df['Machine On/Off'] = ['1' if x >= 115 else '0' for x in df['Rolling Average Center'] ]
arr = df['Machine On/Off']
def find_runs(x):
x = np.asanyarray(x)
if x.ndim !=1:
raise Valueerror('Only 1D array supported')
n = x.shape[0]
if n == 0:
return np.array([]), np.array([]), np.array([])
else:
loc_run_start = np.empty(n, dtype=bool)
loc_run_start[0] = True
np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
run_starts = np.nonzero(loc_run_start)[0]
# find run values
run_values = x[loc_run_start]
# find run lengths
run_lengths = np.diff(np.append(run_starts, n))
return run_values, run_starts, run_lengths
run = find_runs(arr)
df.iloc[_start:run_length]['whatever column']
Suggested first step: make new column with ones and zeros... 1 for on, 0 for off.
df['newcolumnname'] = 0
df['newcolumnname'][df['machine on/off'] == 'on'] = 1
Grab that column as a numpy array:
arr = df['newcolumnname'].to_numpy()
Then using the following code credit: https://gist.github.com/alimanfoo/c5977e87111abe8127453b21204c1065
import numpy as np
def find_runs(x):
"""Find runs of consecutive items in an array."""
# ensure array
x = np.asanyarray(x)
if x.ndim != 1:
raise ValueError('only 1D array supported')
n = x.shape[0]
# handle empty array
if n == 0:
return np.array([]), np.array([]), np.array([])
else:
# find run starts
loc_run_start = np.empty(n, dtype=bool)
loc_run_start[0] = True
np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
run_starts = np.nonzero(loc_run_start)[0]
# find run values
run_values = x[loc_run_start]
# find run lengths
run_lengths = np.diff(np.append(run_starts, n))
return run_values, run_starts, run_lengths
get the "runs"...should only have zeros and ones.
run_values, run_starts, run_lengths = find_runs(arr)
print(run_values) #just to see what order they are in
print(run_starts.shape)#get an idea of the shapes
print(run_lengths.shape)
_ix = run_values.tolist().index(1)
To get data from your pandas,
for _start, _run_length in zip(run_starts[_ix], run_lengths[_ix]):
tmp_df = df.iloc[_start:_start+_run_length]
#do what you want
print(tmp_df)
break #remove this...I just don't know how big your data is
You're correct when you are thinking "this answer does not run". But it should give you enough to get going and solve your problem.
Incomplete question gets an incomplete response.
TODO: make your machine on/off to be integer...not sure if it is absolutely needed, but if it's a numerical numpy array then there are more options available to you.
df['Machine On/Off'] = [1 if x >= 115 else 0 for x in df['Rolling Average Center'] ]

How do you use a list as an index argument for numpy ndarrays?

So I have a problem that might be super duper simple.
I have these numpy ndarrays that I allocated and want to assign values to them via indices returned as lists. It might be easier if I showed you some example code. The questionable code I have is at the bottom, and in my testing (before actually taking this to scale) I keep getting syntax errors :'(
EDIT: edited to make it easier to troubleshoot and put some example code at the bottoms
import numpy as np
def do_stuff(index, mask):
# this is where the calculations are made
magic = sum(mask)
return index, magic
def foo(full_index, comparison_dims, *xargs):
# I have this function executed in Parallel since I'm using a machine with 36 nodes per core, and can access upto 16 cores for each script #blessed
# figure out how many dimensions there are, and how big they are
parent_dims = []
parent_diffs = []
for j in xargs:
parent_dims += [len(j)]
parent_diffs += [j[1] - j[0]] # this is used to find a mask
index = [] # this is where the individual dimension indices will be stored
dim_n = 0
# loop through the dimensions
while dim_n < len(parent_dims):
dim_index = full_index % parent_dims[dim_n]
index += [dim_index]
if dim_n == 0:
mask = (comparison_dims[dim_n] > xargs[dim_n][dim_index] - parent_diffs[dim_n]/2) * \
(comparison_dims[dim_n] <= xargs[dim_n][dim_index] +parent_diffs[dim_n] / 2)
else:
mask *= (comparison_dims[dim_n] > xargs[dim_n][dim_index] - parent_diffs[dim_n]/2) * \
(comparison_dims[dim_n] <=xargs[dim_n][dim_index] + parent_diffs[dim_n] / 2)
full_index //= parent_dims[dim_n]
dim_n += 1
return do_stuff(index, mask)
def bar(comparison_dims, *xargs):
if len(xargs) == comparison_dims.shape[0]:
pass
elif len(comparison_dims.shape) == 2:
pass
else:
raise ValueError("silly person, you failed")
from joblib import Parallel, delayed
dims = []
for j in xargs:
dims += [len(j)]
myArray = np.empty(tuple(dims))
results = Parallel(n_jobs=1)(
delayed(foo)(
index, comparison_dims, *xargs)
for index in range(np.prod(dims))
)
# LOOK HERE, HELP HERE!
for index_list, result in results:
# I thought this would work, but oh golly was I was wrong, index_list here is a list of ints, and result is a value
# for example index, result = [0,3,7], 45.4
# so in execution, that would yield: myArray[0,3,7] = 45.4
# instead it yields SyntaxError because I don't know what I'm doing XD
myArray[*index_list] = result
return myArray
Any ideas how I can make that work. What do I need to do?
I'm not the sharpest tool in the shed, but I think with your help we might be able to figure this out!
A quick example to troubleshoot this problem would be:
compareDims = np.array([np.random.rand(1000), np.random.rand(1000)])
dim0 = np.arange(0,1,1./20)
dim1 = np.arange(0,1,1./30)
myArray = bar(compareDims, dim0, dim1)
To index a numpy array with an arbitrary list of multidimensional indices. you actually need to use a tuple:
for index_list, result in results:
myArray[tuple(index_list)] = result

Python code not working as intended

I started learning Python < 2 weeks ago.
I'm trying to make a function to compute a 7 day moving average for data. Something wasn't going right so I tried it without the function.
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
if j == (i+6):
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
If I run this and look at the value of sum_7, it's just a single value in the numpy array which made all the moving_average values wrong. But if I remove the first for loop with the variable i and manually set i = 0 or any number in the range of the data set and run the exact same code from the inner for loop, sum_7 comes out as a length 7 numpy array. Originally, I just did sum += temp[j] but the same problem occurred, the total sum ended up as just the single value.
I've been staring at this trying to fix it for 3 hours and I'm clueless what's wrong. Originally I wrote the function in R so all I had to do was convert to python language and I don't know why sum_7 is coming up as a single value when there are two for loops. I tried to manually add an index variable to act as i to use it in the range(i, i+7) but got some weird error instead. I also don't know why that is.
https://gyazo.com/d900d1d7917074f336567b971c8a5cee
https://gyazo.com/132733df8bbdaf2847944d1be02e57d2
Hey you can using rolling() function and mean() function from pandas.
Link to the documentation :
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.rolling.html
df['moving_avg'] = df['your_column'].rolling(7).mean()
This would give you some NaN values also, but that is a part of rolling mean because you don't have all past 7 data points for first 6 values.
Seems like you misindented the important line:
moving_average = np.array([])
i = 0
for i in range(len(temp)-6):
sum_7 = np.array([])
avg_7 = 0
missing = 0
total = 7
j = 0
for j in range(i,i+7):
if pd.isnull(temp[j]):
total -= 1
missing += 1
if missing == 7:
moving_average = np.append(moving_average, np.nan)
break
# The following condition should be indented one more level
if not pd.isnull(temp[j]):
sum_7 = np.append(sum_7, temp[j])
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
if j == (i+6):
# this ^ condition does not do what you meant
# you should use a flag instead
avg_7 = sum(sum_7)/total
moving_average = np.append(moving_average, avg_7)
Instead of a flag you can use a for-else construct, but this is not readable. Here's the relevant documentation.
Shorter way to do this:
moving_average = np.array([])
for i in range(len(temp)-6):
ngram_7 = [t for t in temp[i:i+7] if not pd.isnull(t)]
average = (sum(ngram_7) / len(ngram_7)) if ngram_7 else np.nan
moving_average = np.append(moving_average, average)
This could be refactored further:
def average(ngram):
valid = [t for t in temp[i:i+7] if not pd.isnull(t)]
if not valid:
return np.nan
return sum(valid) / len(valid)
def ngrams(seq, n):
for i in range(len(seq) - n):
yield seq[i:i+n]
moving_average = [average(k) for k in ngrams(temp, 7)]

Getting the row and column numbers that meets multiple conditions in Pandas

I am trying to get the row and column number, which meets three conditions in Pandas DataFrame.
I have a DataFrame of 0, 1, -1 (bigger than 1850); when I try to get the row and column it takes forever to get the output.
The following is an example I have been trying to use:
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.randint(2, size=(1845,1850)))
b = pd.DataFrame(np.random.randint(2, size=(5,1850)))
b[b == 1] = -1
c = pd.concat([a,b], ignore_index=True)
column_positive = []
row_positive = []
column_negative = []
row_negative = []
column_zero = []
row_zero = []
for column in range(0, c.shape[0]):
for row in range(0, c.shape[1]):
if c.iloc[column, row] == 1:
column_positive.append(column)
row_positive.append(row)
elif c.iloc[column, row] == -1:
column_negative.append(column)
row_negative.append(row)
else:
column_zero.append(column)
row_zero.append(row)
I did some web searching and found that np.where() does something like this, but I have no idea how to do it.
Could anyone tell a better alternative?
You are right np.where would be one way to do it. Here's an implementation with it -
# Extract the values from c into an array for ease in further processing
c_arr = c.values
# Use np.where to get row and column indices corresponding to three comparisons
column_zero, row_zero = np.where(c_arr==0)
column_negative, row_negative = np.where(c_arr==-1)
column_positive, row_positive = np.where(c_arr==1)
If you don't mind having rows and columns as a Nx2 shaped array, you could do it in a bit more concise manner, like so -
neg_idx, zero_idx, pos_idx = [np.argwhere(c_arr == item) for item in [-1,0,1]]

Numpy for loop gives a different result each time

First time publishing in here, here it goes:
I have two sets of data(v and t), each one has 46 values. The data is imported with "pandas" module and coverted to a numpy array in order to do the calculation.
I need to set ml_min1[45], ml_min2[45], and so on to the value "0". The problem is that each time I ran the script, the values corresponding to the position 45 of ml_min1 and ml_min2 are different. This is the piece of code that I have:
t1 = fil_copy.t1.as_matrix()
t2 = fil_copy.t2.as_matrix()
v1 = fil_copy.v1.as_matrix()
v2 = fil_copy.v2.as_matrix()
ml_min1 = np.empty(len(t1))
l_h1 = np.empty(len(t1))
ml_min2 = np.empty(len(t2))
l_h2 = np.empty(len(t2))
for i in range(0, (len(v1) - 1)):
if (i != (len(v1) - 1)) and (v1[i+1] > v1[i]):
ml_min1[i] = v1[i+1] - v1[i]
l_h1[i] = ml_min1[i] * (60/1000)
elif i == (len(v1)-1):
ml_min1[i] = 0
l_h1[i] = 0
print(i, ml_min1[i])
else:
ml_min1[i] = 0
l_h1[i] = 0
print(i, ml_min1[i])
for i in range(0, (len(v2) - 1)):
if (i != (len(v2) - 1)) and (v2[i+1] > v2[i]):
ml_min2[i] = v2[i+1] - v2[i]
l_h2[i] = ml_min2[i] * (60/1000)
elif i == (len(v2)-1):
ml_min2[i] = 0
l_h2[i] = 0
print(i, ml_min2[i])
else:
ml_min2[i] = 0
l_h2[i] = 0
print(i, ml_min2[i])
Your code as it is currently written doesn't work because the elif blocks are never hit, since range(0, x) does not include x (it stops just before getting there). The easiest way to solve this is probably just to initialize your output arrays with numpy.zeros rather than numpy.empty, since then you don't need to do anything in the elif and else blocks (you can just delete them).
That said, it's generally a design error to use loops like yours in numpy code. Instead, you should use numpy's broadcasting features to perform your mathematical operations to a whole array (or a slice of one) at once.
If I understand correctly, the following should be equivalent to what you wanted your code to do (just for one of the arrays, the other should work the same):
ml_min1 = np.zeros(len(t1)) # use zeros rather than empty, so we don't need to assign any 0s
diff = v1[1:] - v1[:-1] # find the differences between all adjacent values (using slices)
mask = diff > 0 # check which ones are positive (creates a Boolean array)
ml_min1[:-1][mask] = diff[mask] # assign with mask to a slice of the ml_min1 array
l_h1 = ml_min1 * (60/1000) # create l_h1 array with a broadcast scalar multiplication

Categories