Count number of positive columns from large dataframe

Count number of positive columns from large dataframe - python

I have a large DataFrame with 500 columns out of which 300 columns col1, col2, ..col300 appears as follows:
idx col1 col2
a -1 4
b 2 1
c -1 -1
I want to get the following for the 300 columns. Other 200 columns are variables I am not interested in:
idx col1 col2 numPos
a -1 4 1
b 2 1 2
c -1 -1 0
where for each row I want to get the number of positive values. I don't want to use the apply method as there are about 2 million rows in the DataFrame. Is there a pythonic way to do this?

You could select the columns + gt (which creates a boolean DataFrame where it's True if a value is positive) + sum on axis:
df['numPos'] = df[['col1','col2']].gt(0).sum(axis=1)
Maybe you could filter them too, like:
df['numPos'] = df.filter(like='col').gt(0).sum(axis=1)
Output:
idx col1 col2 numPos
0 a -1 4 1
1 b 2 1 2
2 c -1 -1 0

Best way out is to exclude object columns, query what is greater than 0 in the df and sum along the row axis
df['numPos']= df.select_dtypes(exclude='object').gt(0).sum(1)
idx col1 col2 status
0 a -1 4 2
1 b 2 1 3
2 c -1 -1 0

df['numPos'] = (df[cols] > 0).sum(axis=1)
where cols is a list of column names. If the 300 columns are consecutive, then in place of df[cols] you can use df.iloc[:,start_offset:start_offset+300] where start_offset is the index of the first column.

Related

Adding new column to dataframe with value based on existing columns

How would I go through a dataframe and add a new column containing values based on whether the existing columns have the same values for each row?
For example in the following Dataframe I want to add a new column that contains 1 in the rows where Col1 and Col2 contain 1s and 0 if they do not all contain 1.
Col1 Col2
1 1
1 1
1 0
0 0
0 1
1 1
1 1
The output that I would want is
Col1 Col2 Col3
1 1 1
1 1 1
1 0 0
0 0 0
0 1 0
1 1 1
1 1 1
Ideally this would be scalable for more columns in the future (new column would only contain 1 if all columns contain 1)

if there are only 0 and 1 you try with Series.mul
df['Col3'] = df['Col1'].mul(df['Col2'])

If need check if all columns are 1 use DataFrame.all with casting to integers, working if data are only 1 and 0:
df['col3'] = df.all(axis=1).astype(int)
If need test only 1, working for any data use DataFrame.eq for ==:
df['col3'] = df.eq(1).all(axis=1).astype(int)
If want select columns for check add subset:
cols = ['Col1', 'Col2']
df['col3'] = df[cols].all(axis=1).astype(int)
Or:
df['col3'] = df[cols].eq(1).all(axis=1).astype(int)

pandas printing a column if columns value is 1 applied to all columns

I have rows and columns with columns representing actual entities. The Column values apart from the first column are either 1 or 0. The first column is a key. The objective is to return the column name (2nd to last column) if the column value is 1.
This is the function that i have written and it works. Was wondering if there is a better way to express this in Pandas, or even a better way to represent this form of data to make it more pandas friendly.
def return_keys(df,productname):
df2 = df[df['Product']==productname]
print(df2)
columns = list(df2)
cust=[]
for col in columns[1:]:
if (df2[col].to_list()[0]==1):
cust.append(col)
return cust

If your key column does not contain 0/1 , you can try using apply row-wise. Below is an example dataset:
import pandas as pd
import numpy as np
np.random.seed(111)
df = pd.DataFrame({'Product':np.random.choice(['A','B','C'],10),
'Col1':np.random.binomial(1,0.5,10),
'Col2':np.random.binomial(1,0.5,10),
'Col3':np.random.binomial(1,0.5,10)})
df
Product Col1 Col2 Col3
0 A 0 1 1
1 A 1 0 0
2 A 1 1 1
3 A 1 0 0
4 C 1 1 1
5 B 0 1 1
6 C 1 0 0
7 C 0 1 0
8 C 1 1 1
9 A 0 1 0
We apply a boolean and apply (axis=1) onto this boolean data.frame, call out the columns.
(df == 1).apply(lambda x:df.columns[x].tolist(),axis=1)
0 [Col2, Col3]
1 [Col1]
2 [Col1, Col2, Col3]
3 [Col1]
4 [Col1, Col2, Col3]
5 [Col2, Col3]
6 [Col1]
7 [Col2]
8 [Col1, Col2, Col3]
9 [Col2]

Try the following:
df = pd.DataFrame({'a':[1,2,3], 'b':[0,1,2], 'c':[1,2,4], 'd':[0,2,4]})
cols_with_first_element_1 = df.columns[df.iloc[0]==1].to_list()
print(cols_with_first_element_1)
results in ['a', 'c'].

How can I add the values of pandas columns with the same name?

I have a pandas dataframe that looks like as follows:
I have several columns with the same names. I would like to combine those columns in one and sum the values in them. For example in the first row, foot comes up 5 times as a column name. I would like to combine those 5 into one foot column with the sum of the values (1).
For the dataframe above I would like this to be combined to become:
finger foot forearm glute groin
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Essentially the 5 columns that have finger are combined into one column with header finger and the sum of all the items in that row is 0. Similarly, the six columns with foot are combined into one column called foot and the sum is taken of all the 6 columns in that particular row which is 1. I would like to do this for all the columns and have the sum of every item with the same column name.
How could I do this?

Use DataFrame.groupby
Here is an example
df=pd.DataFrame({'col1':[1,2],'col2':[2,3]})
df=pd.concat([df,df],axis=1)
print(df)
col1 col2 col1 col2
0 1 2 1 2
1 2 3 2 3
new_df=df.groupby(level=0,axis=1).sum()
print(new_df)
col1 col2
0 2 4
1 4 6
axis = 1 tells pandas that we want to make groups by columns, and level = 0 tells it that we want to divide the groups based on level 0 of the columns (because axis=1), in this case there is only one level in the columns because there is no MultiIndex in columns, another way to do this would be:
new_df = df.groupby(df.columns, axis=1).sum()
col1 col2
0 2 4
1 4 6

PySpark : How to duplicate the rows of a dataframe based on the values in one column

From a simple dataframe like that in PySpark :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
C 1 6
I would like to duplicate the rows in order to have each value of col1 with each value of col2 and the column count filled with 0 for those we don't have the original value. It would be like that :
col1 col2 count
A 1 4
A 2 8
A 3 2
B 1 3
B 2 0
B 3 0
C 1 6
C 2 0
C 3 0
Do you have any idea how to do that efficiently ?

You're looking for crossJoin.
data = df.select('col1', 'col2')
// this one gives you all combinations of col1+col2
all_combinations = data.alias('a').crossJoin(data.alias('b')).select('a.col1', 'b.col2')
// this one will append with count column from original dataset, and null for all other records.
all_combinations.alias('a').join(df.alias('b'), on=(col(a.col1)==col(b.col1) & col(a.col2)==col(b.col2)), how='left').select('a.*', b.count)

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.

Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count number of positive columns from large dataframe - python

Best way out is to exclude object columns, query what is greater than 0 in the df and sum along the row axis df['numPos']= df.select_dtypes(exclude='object').gt(0).sum(1) idx col1 col2 status 0 a -1 4 2 1 b 2 1 3 2 c -1 -1 0

df['numPos'] = (df[cols] > 0).sum(axis=1) where cols is a list of column names. If the 300 columns are consecutive, then in place of df[cols] you can use df.iloc[:,start_offset:start_offset+300] where start_offset is the index of the first column.

Related

Adding new column to dataframe with value based on existing columns

pandas printing a column if columns value is 1 applied to all columns

How can I add the values of pandas columns with the same name?

PySpark : How to duplicate the rows of a dataframe based on the values in one column

Drop Rows by Multiple Column Criteria in DataFrame

Categories

Resources