Getting the row and column numbers that meets multiple conditions in Pandas - python

I am trying to get the row and column number, which meets three conditions in Pandas DataFrame.
I have a DataFrame of 0, 1, -1 (bigger than 1850); when I try to get the row and column it takes forever to get the output.
The following is an example I have been trying to use:
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.randint(2, size=(1845,1850)))
b = pd.DataFrame(np.random.randint(2, size=(5,1850)))
b[b == 1] = -1
c = pd.concat([a,b], ignore_index=True)
column_positive = []
row_positive = []
column_negative = []
row_negative = []
column_zero = []
row_zero = []
for column in range(0, c.shape[0]):
for row in range(0, c.shape[1]):
if c.iloc[column, row] == 1:
column_positive.append(column)
row_positive.append(row)
elif c.iloc[column, row] == -1:
column_negative.append(column)
row_negative.append(row)
else:
column_zero.append(column)
row_zero.append(row)
I did some web searching and found that np.where() does something like this, but I have no idea how to do it.
Could anyone tell a better alternative?

You are right np.where would be one way to do it. Here's an implementation with it -
# Extract the values from c into an array for ease in further processing
c_arr = c.values
# Use np.where to get row and column indices corresponding to three comparisons
column_zero, row_zero = np.where(c_arr==0)
column_negative, row_negative = np.where(c_arr==-1)
column_positive, row_positive = np.where(c_arr==1)
If you don't mind having rows and columns as a Nx2 shaped array, you could do it in a bit more concise manner, like so -
neg_idx, zero_idx, pos_idx = [np.argwhere(c_arr == item) for item in [-1,0,1]]

Related

Remove following rows that are above or under by X amount from the current row['x']

I am calculating correlations and the data frame I have needs to be filtered.
I am looking to remove the rows under the current row from the data frame that are above or under by X amount starting with the first row and looping through the dataframe all the way until the last row.
example:
df['y'] has the values 50,51,52,53,54,55,70,71,72,73,74,75
if X = 10 it would start at 50 and see 51,52,53,54,55 as within that 10+- range and delete the rows. 70 would stay as it is not within that range and the same test would start again at 70 where 71,72,73,74,75 and respective rows would be deleted
the filter if X=10 would thus leave us with the rows including 50,75 for df.
It would leave me with a clean dataframe that deletes the instances that are linked to the first instance of what is essentially the same observed period. I tried coding a loop to do that but I am left with the wrong result and desperate at this point. Hopefully someone can correct the mistake or point me in the right direction.
df6['index'] = df6.index
df6.sort_values('index')
boom = len(dataframe1.index)/3
#Taking initial comparison values from first row
c = df6.iloc[0]['index']
#Including first row in result
filters = [True]
#Skipping first row in comparisons
for index, row in df6.iloc[1:].iterrows():
if c-boom <= row['index'] <= c+boom:
filters.append(False)
else:
filters.append(True)
# Updating values to compare based on latest accepted row
c = row['index']
df2 = df6.loc[filters].sort_values('correlation').drop('index', 1)
df2
OUTPUT BEFORE
OUTPUT AFTER
IIUC, your main issue is to filter consecutive values within a threshold.
You can use a custom function for that that acts on a Series (=column) to return the list of valid indices:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = []
for i, val in s.iteritems():
if val-prev > threshold:
idx.append(i)
prev = val
return idx
Example of use:
import pandas as pd
df = pd.DataFrame({'y': [50,51,52,53,54,55,70,71,72,73,74,75]})
df2 = df.loc[consecutive(df['y'])]
Output:
y
0 50
6 70
variant
If you prefer the function to return a boolean indexer, here is a varient:
def consecutive(s, threshold = 10):
prev = float('-inf')
idx = [False]*len(s)
for i, val in s.iteritems():
if val-prev > threshold:
idx[i] = True
prev = val
return idx

how to slice a pandas data frame with for loop in another list?

I have a dataframe like this but Its not necessary that there are just 3 sites always:
data = [[501620, 501441,501549], [501832, 501441,501549], [528595, 501662,501549],[501905,501441,501956],[501913,501441,501549]]
df = pd.DataFrame(data, columns = ["site_0", "site_1","site_2"])
I want to slice the dataframe which can take condition dynemically from li(list) element and random combination.
I have tried below code which is a static one:
li = [1,2]
random_com = (501620, 501441,501549)
df_ = df[(df["site_"+str(li[0])] == random_com[li[0]]) & \
(df["site_"+str(li[1])] == random_com[li[1]])]
How can I make the above code dynemic ?
I have tried this but It is giving me two different dataframe but I need one dataframe with both condition (AND).
[df[(df["site_"+str(j)] == random_com[j])] for j in li]
You can do an iteration over the conditions and create an & of all the conditions.
li = [1,2]
main_mask = True
for i in li:
main_mask = main_mask & (df["site_"+str(i)] == random_com[i])
df_ = df[main_mask]
If you prefer one-liner, I think you could use reduce()
df_ = df[reduce(lambda x, y: x & y, [(df["site_"+str(j)] == random_com[j]) for j in li])]

Subset a row based on the column with similar name

Assuming a pandas dataframe like the one in the picture, I would like to fill the na values based with the value of the other variable similar to it. To be more clear, my variables are
mean_1, mean_2 .... , std_1, std_2, ... min_1, min_2 ...
So I would like to fill the na values with the values of the other columns, but not all the columns, only those whose represent the same metric, in the picture i highligted 2 na values. The first one I would like to fill it with the mean obtain from the variables 'MEAN' at row 2, while the second na I would like to fill it with the mean obtain from variable 'MIN' at row 9. Is there a way to do it?
you can find the unique prefixes, iterate through each and do fillna for subsets seperately
uniq_prefixes = set([x.split('_')[0] for x in df.columns])
for prfx in uniq_prefixes:
mask = [col for col in df if col.startswith(prfx)]
# Transpose is needed because row wise fillna is not implemented yet
df.loc[:,mask] = df[mask].T.fillna(df[mask].mean(axis=1)).T
Yes, it is possible doing it using the loop. Below is the naive approach, but even for fancier ones, it is not much optimisation (at least I don't see them).
for i, row in df.iterrows():
sum_means = 0
n_means = 0
sum_stds = 0
n_stds = 0
fill_mean_idxs = []
fill_std_idxs = []
for idx, item in item.iteritems():
if idx.startswith('mean') and item is None:
fill_mean_idxs.append(idx)
elif idx.startswith('mean'):
sum_means += float(item)
n_means += 1
elif idx.startswith('std') and item is None:
fill_std_idxs.append(idx)
elif idx.startswith('std'):
sum_stds += float(item)
n_stds += 1
ave_mean = sum_means / n_means
std_mean = sum_stds / n_stds
for idx in fill_mean_idx:
df.loc[i, idx] = ave_mean
for idx in fill_std_idx:
df.loc[i, idx] = std_mean

How to filter and append arrays

I have a code that I am using to try and filter out arrays that are missing values as seen here:
from astropy.table import Table
import numpy as np
data = '/home/myname/data.fits'
data = Table.read(data, format="fits")
ID = np.array(data['id'])
ID.astype(str)
redshift = np.array(data['z'])
redshift.astype(float)
radius = np.array(data['r'])
radius.astype(float)
mag = np.array(data['M'])
mag.astype(float)
def stack(array1, array2, array3, array4):
#stacks multiple arrays to have corresponding values next to eachother
stacked_array = [(array1[i], array2[i], array3[i], array4[i]) for i in range(0, array1.size)]
stacked_array = np.array(stacked_array)
return(stacked_array)
stacked = stack(ID, redshift, radius, mag)
filtered_array = np.array([])
for i in stacked:
if not i.any == 'nan':
np.insert(filtered_array, i[0], axis=0)
The last for loop is where i'm having difficulty. I want to insert the rows from my stacked array into my filtered array if it has all of the information (some rows are missing redshift, others are missing magnitude etc...). How would I be able to loop over my stacked array and filter out all of the rows that have all 4 values I want? I keep getting this error currently.
TypeError: _insert_dispatcher() missing 1 required positional argument: 'values'
So something like this?
a=[[1,2,3,4],[1,"nan",2,3]]
b=[i for i in a if not any(j=='nan' for j in i)]
which prints [[1, 2, 3, 4]].
You can switch:
for i in stacked:
if not i.any == 'nan':
np.insert(filtered_array, i[0], axis=0)
to:
def any_is_nan(col):
return len(list(filter(lambda x: x=='nan',col))) > 0
filtered_array = list(filter(lambda x: not any_is_nan(x),stacked))
Please refer to filter.

Iterating over numpy arrays to compare columns of different arrays

I am new to programming and had a question. If I had two numpy arrays:
A = np.array([[1,0,3], [2,6,5], [3,4,1],[4,3,2],[5,7,9]], dtype=np.int64)
B = np.array([[3,4,5],[6,7,9],[1,0,3],[4,5,6]], dtype=np.int64)
I want to compare the last two columns of array A to the last two columns of array B, and then if they are equal, output the entire row to a new array. So, the output of these two arrays would be:
[1,0,3
1,0,3
5,7,9
6,7,9]
Because even though the first element does not match for the last two rows, the last two elements do.
Here is my code so far, but it is not even close to working. Can anyone give me some tips?
column_two_A = A[:,1]
column_two_B = B[:,1]
column_three_A = A[:,2]
column_three_B = B[:,2]
column_four_A = A[:,3]
column_four_B = B[:,3]
times = A[:,0]
for elementA in column_three_A:
for elementB in column_three_B:
if elementA == elementB:
continue
for elementC in column_two_A:
for elementD in column_two_B:
if elementC == elementD:
continue
for elementE in column_four_A:
for elementF in column_four_B:
if elementE == elementF:
continue
element.append(time)
print(element)
Numpy holds many functions for that kind of tasks. Here is a solution to check if the values of A are in B. Add print() statements and check what chk, chk2 and x are.
import numpy as np
A = np.array([[1,0,3], [2,6,5], [3,4,1],[4,3,2],[5,7,9]], dtype=np.int64)
B = np.array([[3,4,5],[6,7,9],[1,0,3],[4,5,6]], dtype=np.int64)
c = []
for k in A:
chk = np.equal(k[-2:], B[:, -2:])
chk2 = np.all(chk, axis=1)
x = (B[chk2, :])
if x.size:
c.append(x)
print(c)
I think I figured it out by staying up all night... thank you!
`for i in range(len(A)):
for j in range(len(B)):
if A[i][1] == B[j][1]:
if A[i][2] == B[j][2]:
print(B[j])
print(A[i])`

Categories