Pyspark equivalent of pandas all fuction - python

I have a spark dataframe df:
A B C D
True True True True
True False True True
True None True None
True NaN NaN False
True NaN True True
Is there a way in pyspark to get a fifth column based on rows A, B, C, D not having the value False in them but returning an int value or 1 for True and 0 for false. Hence:
A B C D E
True True True True 1
True False True True 0
True None True None 1
True NaN NaN False 0
True NaN True True 1
This can be acheived in a pandas dataframe with the function df.all().astype(int).
Any help for a pyspark equivalent would be most appreciated, please.

I don't have anything to test, but try the code below:
df2 = df.withColumn(
'E',
(
(F.greatest(*df.columns) == F.least(*df.columns)) &
(F.least(*df.columns) == F.lit(True))
).cast('int')
)

Related

How to find the count of same values in a row in a dataframe?

The dataframe is as follows:
a | b | c | d
-------------------------------
TRUE FALSE TRUE TRUE
FALSE FALSE FALSE TRUE
TRUE TRUE TRUE TRUE
TRUE FALSE TRUE FALSE
I need to find the count of the TRUE's in each column.
The last row should contain the count as follows:
a | b | c | d | count
---------------------------------------
TRUE FALSE TRUE TRUE 3
FALSE FALSE FALSE TRUE 1
TRUE TRUE TRUE TRUE 4
TRUE FALSE TRUE FALSE 2
The logic I tried is:
df.groupby(df.columns.tolist(),as_index=False).size()
But it doesn't work as expected.
Could anyone please help me out here?
Thank you.
Because Trues are processing like 1 you can use sum:
df['count'] = df.sum(axis=1)
If TRUEs are strings:
df['count'] = df.eq('TRUE').sum(axis=1)

Filtering Boolean in Dataframe

I have a df with ±100k rows and 10 columns.
I would like to find/filter which rows contain at least 2 to 4 True values.
For simplicity's sake, let's say I have this df:
A B C D E F
1 True True False False True
2 False True True True False
3 False False False False False
4 True False False False True
5 True False False False False
Expected output:
A B C D E F
1 True True False False True
2 False True True True False
4 True False False False True
I have tried using
df[(df['B']==True) | (df['C']==True) | (df['D']==True)| (df['E']==True)| (df['F']==True)]
But this only eliminates False rows and doesn't work if I want to find instances of at least 2/3 True.
Can anyone please help? Appreciate it.
Use DataFrame.select_dtypes for only boolean columns, count Trues by sum and then filter values by Series.between in boolean indexing:
df = df[df.select_dtypes(bool).sum(axis=1).between(2,4)]
print (df)
A B C D E F
0 1 True True False False True
1 2 False True True True False
3 4 True False False False True

how to check occurance of string across two or more columns for each row and assign the final column with 0

id. datcol1 datacol2 datacol-n final col(to be created in output)
1 false true true 0
2 false false false 2
3 true true true 0
4 true false false 1
there are multiple columns say 13,
So the job is to take each row id across all the column and
check if the columns have atleast or equalto two "true" strings then assign 0 ; and if one "true "string then assign 1, if no "true" at all then assign 2
Considering df to be:
In [1542]: df
Out[1542]:
id. datcol1 datacol2 datacol-n
0 1 False True True
1 2 False False False
2 3 True True True
3 4 True False False
Use numpy.select, df.filter, Series.ge and df.sum:
In [1546]: import numpy as np
In [1547]: x = df.filter(like='dat').sum(1)
In [1548]: conds = [x.ge(2), x.eq(1), x.eq(0)]
In [1549]: choices = [0, 1, 2]
In [1553]: df['flag'] = np.select(conds, choices)
In [1554]: df
Out[1554]:
id. datcol1 datacol2 datacol-n flag
0 1 False True True 0
1 2 False False False 2
2 3 True True True 0
3 4 True False False 1

keep rows and the previous one which are not equal

I have this kind of dataframe :
d = {1 : [False,False,False,False,True],2:[False,True,True,True,False],3 :[True,False,False,False,True]}
df = pd.DataFrame(d)
df :
1 2 3
0 False False True
1 False True False
2 False True False
3 False True False
4 True False True
My goal is to keep rows n+1 and n where rows n+1 and n are differents. In the example df, the result would be :
df_result :
1 2 3
0 False False True
1 False True False
3 False True False
4 True False True
I have already tried this line df_result = df.neq(df.shift()) and kept only rows where there is a least one true but it doesn't get the row 3
Any idea how i can have the expected result ?
Thanks !
I believe you need compare bot not equal by DataFrame.ne shifting by 1 and by -1, get at least one match by DataFrame.any and chain with | for bitwise OR:
df_result = df[df.ne(df.shift()).any(axis=1) | df.ne(df.shift(-1)).any(axis=1)]
print (df_result)
1 2 3
0 False False True
1 False True False
3 False True False
4 True False True
Another similar idea:
df_result = df[(df.ne(df.shift()) | df.ne(df.shift(-1))).any(axis=1)]

How to select column names in python pandas by dataframe values?

I have the following dataframe:
pandas.DataFrame(numpy.random.randn(10, 5) > 1, index=range(1, 11), columns=list('ABCDE'))
A B C D E
1 False False False False False
2 False False False False False
3 True True False True False
4 False False True True False
5 False False False False False
6 False False False False False
7 False False False False False
8 False False False False False
9 False False False False False
10 False True False True False
For each row I would like to get the column name that is the last one in that row containing True.
If there isn't any, return any resonable value.
How can I do that?
A one liner:
>>> value = np.nan
>>> df.reindex_axis(df.columns[::-1], axis=1)\ # flip vertically
.idxmax(axis=1)\ # find last(now first) True value
.reset_index()\ # get index for the next step
.apply(lambda x: value if (x[0]==df.columns[-1] and not df.ix[x['index'], x[0]])
else x[0], axis=1) # =value if col=="E" and value==False
Out [1]:
0 NaN
1 NaN
2 D
3 D
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 D
Explanation:
idxmax returns the index of the max value in a row, if there is more than one max it returns the first one. We want the last one so we flip the dataframe vertically.
Finally we must replace the obtained Series with value if col=="E" and value==False. You can't apply a condition on the index of a Series, thats why you need the reset_index first.
This last step could be more elegantly done with df.replace({'E': {False: value}), which replaces False in column 'E' with value, but somehow it doesn't work for me.
set up the example data first:
np.random.seed(1)
df = pd.DataFrame( (randn(10, 5) > 1) , index=range(1, 11), columns=['A','B','C','D','E'])
df
looks like:
A B C D E
1 True False False False False
2 False True False False False
3 True False False False True
4 False False False False False
5 False True False False False
6 False False False False False
7 False False False False False
8 False False False True False
9 False False False True False
10 False False True False False
it sounds like what you want to do is get the index # for each true value and then select the max index #. On a single column that might look like the following:
df['A'][df['A']].index.max()
which returns 3. To do this for all the columns, the easiest is to iterate through each column and shove the result in a list:
mylist = []
for col in df.columns:
myval = df[col][df[col]].index.max()
mylist.append(myval)
mylist
that returns:
[3, 5, 10, 9, 3]
the loop logic above returns nan if there is no True value in the column.

Categories