Drop Rows by Multiple Column Criteria in DataFrame - python

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.

Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Related

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

How to drop conflicted rows in Dataframe?

I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.
I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0

Scan subset of PD DataFrame to obtain indices matching certain values

I have a dataframe. Some of the columns should have only 0s or 1s. I need to find the columns that have a number other than 0 or 1 and remove that entire row from the original dataset.
I have created a second data frame consisting of the columns that must be checked. After finding the indices and dropping them from the original data frame, I am not getting the right answer.
#Reading in the data:
data=pd.read_csv('DataSet.csv')
#Creating subset df of the columns that must be only 0 or 1 (which is all rows in columns 2 onwards:
subset = data.iloc[:,2:]
#find indices:
index = subset[ (subset!= 0) & (subset!= 1)].index
#remove rows from orig data set:
data = data.drop(index)
It is giving me an empty index array. PLEASE HELP.
Sample:
data = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'D':[1,0,1,0,1,0],
'E':[1,0,0,1,2,4],
})
print (data)
A B D E
0 a 4 1 1
1 b 5 0 0
2 c 4 1 0
3 d 5 0 1
4 e 5 1 2
5 f 4 0 4
If need only 1 and 0 values use DataFrame.isin with DataFrame.all for test if all Trues per rows:
subset = data.iloc[:,2:]
data3 = data[subset.isin([0,1]).all(axis=1)]
print (data3)
A B D E
0 a 4 1 1
1 b 5 0 0
2 c 4 1 0
3 d 5 0 1
Details:
print (subset.isin([0,1]))
D E
0 True True
1 True True
2 True True
3 True True
4 True False
5 True False
print (subset.isin([0,1]).all(axis=1))
0 True
1 True
2 True
3 True
4 False
5 False
dtype: bool
Your subset is a pd.DataFrame, not a pd.Series. The conditional testing you are doing for index would work if subset were a Series (i.e. if you were only checking the condition on a single column, not multiple columns).
So having subset as a DataFrame is fine, but it changes how the conditional slice works. My testing shows your index var returns NaN for 0s and 1s, (rather than leaving them out like a slice of a Series would). Adding dropna() as below should fix your code:
#find indices:
index = subset[ (subset!= 0) & (subset!= 1)].dropna().index
#remove rows from orig data set:
data = data.drop(index)
From you code I made a calculated guess that you want to compare for more than 1 columns.
This should do the trick
# Selects only elements that are 0 or 1
val = np.isin(subset, np.array([0, 1]))
# Generate index
index = np.prod(val, axis=1) > 0
# Select only desired columns
data = data[index]
Example
# Data
a b c
0 1 1 1
1 2 2 2
2 3 1 3
3 4 3 3
4 5 3 1
# Removing rows that have elements other than 1 or 2
a b c
0 1 1 1
1 2 2 2
Without your data from DataSet.csv, I tried to make a guess.
subset[ (subset!= 0) & (subset!= 1)] basically returns the subset dataframe with values False on (subset!= 0) & (subset!= 1) turning to NaN while those True keeping same values. I.e. this is equivalent to map. It is not a filter.
Therefore, subset[ (subset!= 0) & (subset!= 1)].index is the whole index of your data dataframe
You drop it, so it returns empty dataframe

Creating a pandas column conditional to another columns values

I'm trying to create a class column in a pandas dataframe conditional another columns values. The value will be 1 if the other column's i+1 value is greater than the i value and 0 otherwise.
For example:
column1 column2
5 1
6 0
3 0
2 1
4 0
How do create column2 by iterating through column1?
You can use the diff method on the first column with a period of -1, then check if it is less than zero to create the second column.
import pandas as pd
df = pd.DataFrame({'c1': [5,6,3,2,4]})
df['c2'] = (df.c1.diff(-1) < 0).astype(int)
df
# returns:
c1 c2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0
You can also use shift. Performance is almost the same as diff but diff seems to be faster by a a little.
df = pd.DataFrame({'column1': [5,6,3,2,4]})
df['column2'] = (df['column1'] <df['column1'].shift(-1)).astype(int)
print(df)
column1 column2
0 5 1
1 6 0
2 3 0
3 2 1
4 4 0

Copy pandas DataFrame row to multiple other rows

Simple and practical question, yet I can't find a solution.
The questions I took a look were the following:
Modifying a subset of rows in a pandas dataframe
Changing certain values in multiple columns of a pandas DataFrame at once
Fastest way to copy columns from one DataFrame to another using pandas?
Selecting with complex criteria from pandas.DataFrame
The key difference between those and mine is that I need not to insert a single value, but a row.
My problem is, I pick up a row of a dataframe, say df1. Thus I have a series.
Now I have this other dataframe, df2, that I have selected multiple rows according to a criteria, and I want to replicate that series to all those row.
df1:
Index/Col A B C
1 0 0 0
2 0 0 0
3 1 2 3
4 0 0 0
df2:
Index/Col A B C
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
What I want to accomplish is inserting df1[3] into the lines df2[2] and df3[3] for example. So something like the non working code:
series = df1[3]
df2[df2.index>=2 and df2.index<=3] = series
returning
df2:
Index/Col A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Use loc and pass a list of the index labels of interest, after the following comma the : indicates we want to set all column values, we then assign the series but call attribute .values so that it's a numpy array. Otherwise you will get a ValueError as there will be a shape mismatch as you're intending to overwrite 2 rows with a single row and if it's a Series then it won't align as you desire:
In [76]:
df2.loc[[2,3],:] = df1.loc[3].values
df2
Out[76]:
A B C
1 0 0 0
2 1 2 3
3 1 2 3
4 0 0 0
Suppose you have to copy certain rows and columns from dataframe to some another data frame do this.
code
df2 = df.loc[x:y,a:b] // x and y are rows bound and a and b are column
bounds that you have to select

Categories