How would I go through a dataframe and add a new column containing values based on whether the existing columns have the same values for each row?
For example in the following Dataframe I want to add a new column that contains 1 in the rows where Col1 and Col2 contain 1s and 0 if they do not all contain 1.
Col1 Col2
1 1
1 1
1 0
0 0
0 1
1 1
1 1
The output that I would want is
Col1 Col2 Col3
1 1 1
1 1 1
1 0 0
0 0 0
0 1 0
1 1 1
1 1 1
Ideally this would be scalable for more columns in the future (new column would only contain 1 if all columns contain 1)
if there are only 0 and 1 you try with Series.mul
df['Col3'] = df['Col1'].mul(df['Col2'])
If need check if all columns are 1 use DataFrame.all with casting to integers, working if data are only 1 and 0:
df['col3'] = df.all(axis=1).astype(int)
If need test only 1, working for any data use DataFrame.eq for ==:
df['col3'] = df.eq(1).all(axis=1).astype(int)
If want select columns for check add subset:
cols = ['Col1', 'Col2']
df['col3'] = df[cols].all(axis=1).astype(int)
Or:
df['col3'] = df[cols].eq(1).all(axis=1).astype(int)
Related
I have the following DataFrame with some numbers in them where the sum of the values in Col1, Col2, and Col3 is equal to the value in column Main.
How can I replace the values in the Cat columns if they are equal to the corresponding value in the Main column?
For example, the following DataFrame:
Main Col1 Col2 Col3
0 100 50 50 0
1 200 0 200 0
2 30 20 5 5
3 500 0 0 500
would be changed to this:
Main Col1 Col2 Col3
0 100 50 50 0
1 200 0 EQUAL 0
2 30 20 5 5
3 500 0 0 EQUAL
You can use filter to apply only on the "Col" columns (you could also use slicing with a list, see alternative), then mask to change the matching values, finally update to update the DataFrame in place:
df.update(df.filter(like='Col').mask(df.eq(df['Main'], axis=0), 'EQUAL'))
Alternative:
cols = ['Col1', 'Col2', 'Col3']
df.update(df[cols].mask(df.eq(df['Main'], axis=0), 'EQUAL'))
Output:
Main Col1 Col2 Col3
0 100 50 50 0
1 200 0 EQUAL 0
2 30 20 5 5
3 500 0 0 EQUAL
There are several different ways of doing this, I suggest using the np.where() function.
import numpy as np
df['Col1'] = np.where(df['Col1'] == df['Main'], 'EQUAL', df['Col1']
df['Col2'] = np.where(df['Col2'] == df['Main'], 'EQUAL', df['Col2']
df['Col3'] = np.where(df['Col3'] == df['Main'], 'EQUAL', df['Col3']
Read more about np.where() here.
I have a large DataFrame with 500 columns out of which 300 columns col1, col2, ..col300 appears as follows:
idx col1 col2
a -1 4
b 2 1
c -1 -1
I want to get the following for the 300 columns. Other 200 columns are variables I am not interested in:
idx col1 col2 numPos
a -1 4 1
b 2 1 2
c -1 -1 0
where for each row I want to get the number of positive values. I don't want to use the apply method as there are about 2 million rows in the DataFrame. Is there a pythonic way to do this?
You could select the columns + gt (which creates a boolean DataFrame where it's True if a value is positive) + sum on axis:
df['numPos'] = df[['col1','col2']].gt(0).sum(axis=1)
Maybe you could filter them too, like:
df['numPos'] = df.filter(like='col').gt(0).sum(axis=1)
Output:
idx col1 col2 numPos
0 a -1 4 1
1 b 2 1 2
2 c -1 -1 0
Best way out is to exclude object columns, query what is greater than 0 in the df and sum along the row axis
df['numPos']= df.select_dtypes(exclude='object').gt(0).sum(1)
idx col1 col2 status
0 a -1 4 2
1 b 2 1 3
2 c -1 -1 0
df['numPos'] = (df[cols] > 0).sum(axis=1)
where cols is a list of column names. If the 300 columns are consecutive, then in place of df[cols] you can use df.iloc[:,start_offset:start_offset+300] where start_offset is the index of the first column.
I have rows and columns with columns representing actual entities. The Column values apart from the first column are either 1 or 0. The first column is a key. The objective is to return the column name (2nd to last column) if the column value is 1.
This is the function that i have written and it works. Was wondering if there is a better way to express this in Pandas, or even a better way to represent this form of data to make it more pandas friendly.
def return_keys(df,productname):
df2 = df[df['Product']==productname]
print(df2)
columns = list(df2)
cust=[]
for col in columns[1:]:
if (df2[col].to_list()[0]==1):
cust.append(col)
return cust
If your key column does not contain 0/1 , you can try using apply row-wise. Below is an example dataset:
import pandas as pd
import numpy as np
np.random.seed(111)
df = pd.DataFrame({'Product':np.random.choice(['A','B','C'],10),
'Col1':np.random.binomial(1,0.5,10),
'Col2':np.random.binomial(1,0.5,10),
'Col3':np.random.binomial(1,0.5,10)})
df
Product Col1 Col2 Col3
0 A 0 1 1
1 A 1 0 0
2 A 1 1 1
3 A 1 0 0
4 C 1 1 1
5 B 0 1 1
6 C 1 0 0
7 C 0 1 0
8 C 1 1 1
9 A 0 1 0
We apply a boolean and apply (axis=1) onto this boolean data.frame, call out the columns.
(df == 1).apply(lambda x:df.columns[x].tolist(),axis=1)
0 [Col2, Col3]
1 [Col1]
2 [Col1, Col2, Col3]
3 [Col1]
4 [Col1, Col2, Col3]
5 [Col2, Col3]
6 [Col1]
7 [Col2]
8 [Col1, Col2, Col3]
9 [Col2]
Try the following:
df = pd.DataFrame({'a':[1,2,3], 'b':[0,1,2], 'c':[1,2,4], 'd':[0,2,4]})
cols_with_first_element_1 = df.columns[df.iloc[0]==1].to_list()
print(cols_with_first_element_1)
results in ['a', 'c'].
I have a data frame with true/false values stored in string format. Some values are null in the data frame.
I need to encode this data such that TRUE/FALSE/null values are encoded with the same integer in every column.
Input:
col1 col2 col3
True True False
True True True
null null True
I am using:
le = preprocessing.LabelEncoder()
df.apply(le.fit_transform)
Output:
2 1 0
2 1 1
1 0 1
But I want the output as:
2 2 0
2 2 2
1 1 2
How do i do this?
For me working create one column DataFrame:
df = df.stack(dropna=False).to_frame().apply(le.fit_transform)[0].unstack()
print (df)
col1 col2 col3
0 1 1 0
1 1 1 1
2 2 2 1
Another idea is use DataFrame.replace with 'True' instead True, because:
I have a data frame with true/false values stored in string format.
If null are missing values:
df = df.replace({'True':2, 'False':1, np.nan:0})
If null are strings null:
df = df.replace({'True':2, 'False':1, 'null':0})
print (df)
col1 col2 col3
0 2 2 1
1 2 2 2
2 0 0 2
I have a pandas dataframe that looks like as follows:
I have several columns with the same names. I would like to combine those columns in one and sum the values in them. For example in the first row, foot comes up 5 times as a column name. I would like to combine those 5 into one foot column with the sum of the values (1).
For the dataframe above I would like this to be combined to become:
finger foot forearm glute groin
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
Essentially the 5 columns that have finger are combined into one column with header finger and the sum of all the items in that row is 0. Similarly, the six columns with foot are combined into one column called foot and the sum is taken of all the 6 columns in that particular row which is 1. I would like to do this for all the columns and have the sum of every item with the same column name.
How could I do this?
Use DataFrame.groupby
Here is an example
df=pd.DataFrame({'col1':[1,2],'col2':[2,3]})
df=pd.concat([df,df],axis=1)
print(df)
col1 col2 col1 col2
0 1 2 1 2
1 2 3 2 3
new_df=df.groupby(level=0,axis=1).sum()
print(new_df)
col1 col2
0 2 4
1 4 6
axis = 1 tells pandas that we want to make groups by columns, and level = 0 tells it that we want to divide the groups based on level 0 of the columns (because axis=1), in this case there is only one level in the columns because there is no MultiIndex in columns, another way to do this would be:
new_df = df.groupby(df.columns, axis=1).sum()
col1 col2
0 2 4
1 4 6