Search boolean matrix using pyspark

Search boolean matrix using pyspark - python

I have a boolean matrix of M x N, where M = 6000 and N = 1000
1 | 0 1 0 0 0 1 ----> 1000
2 | 1 0 1 0 1 0 ----> 1000
3 | 0 0 1 1 0 0 ----> 1000
V
6000
Now for each column, I want to find the first occurrence where the value is 1. For the above example, in the first 5 columns, I want 2 1 2 3 2 1.
Now the code I have is
sig_matrix = list()
num_columns = df.columns
for col_name in num_columns:
print('Processing column {}'.format(col_name))
sig_index = df.filter(df[col_name] == 1).\
select('perm').limit(1).collect()[0]['perm']
sig_matrix.append(sig_index)
Now the above code is really slow and it takes 5~7 minutes for me to parse 1000 columns is there any faster ways to do this instead of what I am doing? I am also willing to use pandas data frame instead of pyspark dataframe if that is faster.

Here is a numpy version that runs <1s for me, so should be preferable for this size of data:
arr=np.random.choice([0,1], size=(6000,1000))
[np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)]
There could well be more efficient numpy solutions.

I ended up solving my problem using numpy. Here is how I did it.
import numpy as np
sig_matrix = list()
columns = list(df)
for col_name in columns:
sig_index = np.argmax(df[col_name]) + 1
sig_matrix.append(sig_index)
As the values in my columns are 0 and 1, argmax will return the first occurrence of value 1.

Related

Not-quite gradient of dataframe

I have a dataframe of ints:
mydf = pd.DataFrame([[0,0,0,1,0,2,2,5,2,4],
[0,1,0,0,2,2,4,5,3,3],
[1,1,1,1,2,2,0,4,4,4]])
I'd like to calculate something that resembles the gradient given by pd.Series.dff() for each row, but with one big change: my ints represent categorical data, so I'm only interested in detecting a change, not the magnitude of it. So the step from 0 to 1 should be the same as the step from 0 to 4.
Is there a way for pandas to interpret my data as categorical in the data frame, and then calculate a Series.diff() on that? Or could you "flatten" the output of Series.diff() to be only 0s and 1s?

If I understand you correctly, this is what you are trying to achieve:
import pandas as pd
mydf = pd.DataFrame([[0,0,0,1,0,2,2,5,2,4],
[0,1,0,0,2,2,4,5,3,3],
[1,1,1,1,2,2,0,4,4,4]])
mydf = mydf.astype("category")
diff_df = mydf.apply(lambda x: x.diff().ne(0), axis=1).astype(int)
The ne returns a boolean array which indicates if the difference between consecutive values is different from zero. Then you use the astype to convert the boolean values to integers (0s and 1s). The result is a dataframe with the same number of rows as the original dataframe, and the same number of columns, but with binary values indicating a change in the categorical value from one step to the next.
0 1 2 3 4 5 6 7 8 9
0 1 0 0 1 1 1 0 1 1 1
1 1 1 1 0 1 0 1 1 1 0
2 1 0 0 0 1 0 1 1 0 0

Apply function to each cell in a row, based on another cell

For example, I have the following:
.
a
b
benchmark
0
1
2
1
1
1
5
3
and I would like to apply a condition in Pandas for each column as:
def f(x):
if x > benchmark:
# X being the values of a or b
return x
else:
return 0
But I don't know how to do that. If I did df.apply(f) I can't access other cells in the row as x is just the value of the one cell.
I don't want to create a new column either. I want to directly change the value of the cell as I compare it to benchmark, clearing or 0'ing the cells that that do not meet the benchmark.
Any insight?

You don't need a function, instead use vectorial operations:
out = df.where(df.gt(df['benchmark'], axis=0), 0)
To change the values in place:
df[df.le(df['benchmark'], axis=0)] = 0
Output:
a b benchmark
0 0 2 0
1 0 5 0
If you don't want to affect benchmark:
m = df.le(df['benchmark'], axis=0)
m['benchmark'] = False
df[m] = 0
Output:
a b benchmark
0 0 2 1
1 0 5 3

Merge Pandas Dataframe based on boolean function

I am looking for an efficient way to merge two pandas data frames based on a function that takes as input columns from both data frames and returns True or False. E.g. Assume I have the following "tables":
import pandas as pd
df_1 = pd.DataFrame(data=[1, 2, 3])
df_2 = pd.DataFrame(data=[4, 5, 6])
def validation(a, b):
return ((a + b) % 2) == 0
I would like to join df1 and df2 on each row where the sum of the first column is an even number. The resulting table would be
1 5
df_3 = 2 4
2 6
3 5
Please think of it as a general problem not as a task to return just df_3. The solution should accept any function that validates a combination of columns and return True or False.
THX Lazloo

You can do with merge on parity:
(df_1.assign(parity=df_1[0]%2)
.merge(df_2.assign(parity=df_2[0]%2), on='dummy')
.drop('parity', axis=1)
)
output:
0_x 0_y
0 1 5
1 3 5
2 2 4
3 2 6

You can use broadcasting, or the outer functions, to compare all rows. You'll run into issues as the length becomes large.
import pandas as pd
import numpy as np
def validation(a, b):
"""a,b : np.array"""
arr = np.add.outer(a, b) # How to combine rows
i,j = np.where(arr % 2 == 0) # Condition
return pd.DataFrame(np.stack([a[i], b[j]], axis=1))
validation(df_1[0].to_numpy(), df_2[0].to_numpy())
0 1
0 1 5
1 2 4
2 2 6
3 3 5
In this particular case you might leverage the fact that even numbers maintain parity when added to even numbers, and odd numbers change parity when added to odd numbers, so define that column and merge on that.
df_1['parity'] = df_1[0]%2
df_2['parity'] = df_2[0]%2
df_3 = df_1.merge(df_2, on='parity')
0_x parity 0_y
0 1 1 5
1 3 1 5
2 2 0 4
3 2 0 6

This is a basic solution but not very efficient if you are working on large dataframes
df_1.index *= 0
df_2.index *= 0
df = df_1.join(df_2, lsuffix='_2')
df = df[df.sum(axis=1) % 2 == 0]
Edit,
here is a better solution
df_1.index = df_1.iloc[:,0] % 2
df_2.index = df_2.iloc[:,0] % 2
df = df_1.join(df_2, lsuffix='_2')

Conditional statement and split in a Dataframe

I am looking for a conditional statement in python to look for a certain information in a specified column and put the results in a new column
Here is an example of my dataset:
OBJECTID CODE_LITH
1 M4,BO
2 M4,BO
3 M4,BO
4 M1,HP-M7,HP-M1
and what I want as results:
OBJECTID CODE_LITH M4 M1
1 M4,BO 1 0
2 M4,BO 1 0
3 M4,BO 1 0
4 M1,HP-M7,HP-M1 0 1
What I have done so far:
import pandas as pd
import numpy as np
lookup = ['M4']
df.loc[df['CODE_LITH'].str.isin(lookup),'M4'] = 1
df.loc[~df['CODE_LITH'].str.isin(lookup),'M4'] = 0
Since there is multiple variables per rows in "CODE_LITH" it seems like the script in not able to find only "M4" it can find "M4,BO" and put 1 or 0 in the new column
I have also tried:
if ('M4') in df['CODE_LITH']:
df['M4'] = 0
else:
df['M4'] = 1
With the same results.
Thanks for your help.
PS. The dataframe contains about 2.6 millions rows and I need to do this operation for 30-50 variables.

I think this is the Pythonic way to do it:
for mn in ['M1', 'M4']: # Add other "M#" as needed
df[mn] = df['CODE_LITH'].map(lambda x: mn in x)

Use str.contains accessor:
>>>> for key in ('M4', 'M1'):
... df.loc[:, key] = df['CODE_LITH'].str.contains(key).astype(int)
>>> df
OBJECTID CODE_LITH M4 M1
0 1 M4,BO 1 0
1 2 M4,BO 1 0
2 3 M4,BO 1 0
3 4 M1,HP-M7,HP-M1 0 1

I was able to do:
for index,data in enumerate(df['CODE_LITH']):
if "I1" in data:
df['Plut_Felsic'][index] = 1
else:
df['Plut_Felsic'][index] = 0
It does work, but takes quite some time to calculate.

Pandas: outer product of row and col sums

In Pandas, I am trying to manually code a chi-square test. I am comparing row 0 with row 1 in the dataframe below.
data
2 3 5 10 30
0 3 0 6 5 0
1 33324 15833 58305 54402 38920
For this, I need to calculate the expected cell counts for each cell as: cell(i,j) = rowSum(i)*colSum(j) / sumAll. In R, I can do this simply by taking the outer() products:
Exp_counts <- outer(rowSums(data), colSums(data), "*")/sum(data) # Expected cell counts
I used numpy's outer product function to imitate the outcome of the above R code:
import numpy as np
pd.DataFrame(np.outer(data.sum(axis=1),data.sum(axis=0))/ (data.sum().sum()), index=data.index, columns=data.columns.values)
2 3 5 10 30
0 2 1 4 3 2
1 33324 15831 58306 54403 38917
Is it possible to achieve this with a Pandas function?

A Complete solution using only Pandas built-in methods:
def outer_product(row):
numerator = df.sum(1).mul(row.sum(0))
denominator = df.sum(0).sum(0)
return (numerator.floordiv(denominator))
df.apply(outer_product)
Timings: For 1 million rows of DF.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search boolean matrix using pyspark - python

Here is a numpy version that runs <1s for me, so should be preferable for this size of data: arr=np.random.choice([0,1], size=(6000,1000)) [np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)] There could well be more efficient numpy solutions.

Related

Not-quite gradient of dataframe

Apply function to each cell in a row, based on another cell

Merge Pandas Dataframe based on boolean function

Conditional statement and split in a Dataframe

Pandas: outer product of row and col sums

Categories

Resources