Scan subset of PD DataFrame to obtain indices matching certain values - python

I have a dataframe. Some of the columns should have only 0s or 1s. I need to find the columns that have a number other than 0 or 1 and remove that entire row from the original dataset.
I have created a second data frame consisting of the columns that must be checked. After finding the indices and dropping them from the original data frame, I am not getting the right answer.
#Reading in the data:
data=pd.read_csv('DataSet.csv')
#Creating subset df of the columns that must be only 0 or 1 (which is all rows in columns 2 onwards:
subset = data.iloc[:,2:]
#find indices:
index = subset[ (subset!= 0) & (subset!= 1)].index
#remove rows from orig data set:
data = data.drop(index)
It is giving me an empty index array. PLEASE HELP.

Sample:
data = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'D':[1,0,1,0,1,0],
'E':[1,0,0,1,2,4],
})
print (data)
A B D E
0 a 4 1 1
1 b 5 0 0
2 c 4 1 0
3 d 5 0 1
4 e 5 1 2
5 f 4 0 4
If need only 1 and 0 values use DataFrame.isin with DataFrame.all for test if all Trues per rows:
subset = data.iloc[:,2:]
data3 = data[subset.isin([0,1]).all(axis=1)]
print (data3)
A B D E
0 a 4 1 1
1 b 5 0 0
2 c 4 1 0
3 d 5 0 1
Details:
print (subset.isin([0,1]))
D E
0 True True
1 True True
2 True True
3 True True
4 True False
5 True False
print (subset.isin([0,1]).all(axis=1))
0 True
1 True
2 True
3 True
4 False
5 False
dtype: bool

Your subset is a pd.DataFrame, not a pd.Series. The conditional testing you are doing for index would work if subset were a Series (i.e. if you were only checking the condition on a single column, not multiple columns).
So having subset as a DataFrame is fine, but it changes how the conditional slice works. My testing shows your index var returns NaN for 0s and 1s, (rather than leaving them out like a slice of a Series would). Adding dropna() as below should fix your code:
#find indices:
index = subset[ (subset!= 0) & (subset!= 1)].dropna().index
#remove rows from orig data set:
data = data.drop(index)

From you code I made a calculated guess that you want to compare for more than 1 columns.
This should do the trick
# Selects only elements that are 0 or 1
val = np.isin(subset, np.array([0, 1]))
# Generate index
index = np.prod(val, axis=1) > 0
# Select only desired columns
data = data[index]
Example
# Data
a b c
0 1 1 1
1 2 2 2
2 3 1 3
3 4 3 3
4 5 3 1
# Removing rows that have elements other than 1 or 2
a b c
0 1 1 1
1 2 2 2

Without your data from DataSet.csv, I tried to make a guess.
subset[ (subset!= 0) & (subset!= 1)] basically returns the subset dataframe with values False on (subset!= 0) & (subset!= 1) turning to NaN while those True keeping same values. I.e. this is equivalent to map. It is not a filter.
Therefore, subset[ (subset!= 0) & (subset!= 1)].index is the whole index of your data dataframe
You drop it, so it returns empty dataframe

Related

Get row count in DataFrame without for loop

I need to find if the last value of dataframe['position'] is different from 0, then count the previous (so in reverse) values until them changes and store the counted index before the change in a variable, this without for loops. By loc or iloc for example...
dataframe:
| position |
0 1
1 0
2 1 <4
3 1 <3
4 1 <2
5 1 <1
count = 4
I achieved this by a for loop, but I need to avoid it:
count = 1
if data['position'].iloc[-1] != 0:
for i in data['position']:
if data['position'].iloc[-count] == data['position'].iloc[-1]:
count = count + 1
else:
break
if data['position'].iloc[-count] != data['position'].iloc[-1]:
count = count - 1
You can reverse your Series, convert to boolean using the target condition (here "not equal 0" with ne), and apply a cummin to propagate the False upwards and sum to count the trailing True:
count = df.loc[::-1, 'position'].ne(0).cummin().sum()
Output: 4
If you have multiple columns:
counts = df.loc[::-1].ne(0).cummin().sum()
alternative
A slightly faster alternative (~25% faster), but relying on the assumptions that you have at least one zero and non duplicated indices could be to find the last zero and use indexing
m = df['position'].eq(0)
count = len(df.loc[m[m].index[-1]:])-1
Without the requirement to have at least one zero:
m = df['position'].eq(0)
m = m[m]
count = len(df) if m.empty else len(df.loc[m.index[-1]:])-1
This should do the trick:
((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
This builds a condition ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])) indicating whether the value in each row (counting backwards from the end) is nonzero and equals the last value. Then, the values are coerced into 0 or 1 and the cumulative product is taken, so that the first non-matching zero will break the sequence and all subsequent values will be zero. Then the flags are summed to get the count of these consecutive matched values.
Depending on your data, though, stepping iteratively backwards from the end may be faster. This solution is vectorized, but it requires working with the entire column of data and doing several computations which are the same size as the original series.
Example:
In [12]: data = pd.DataFrame(np.random.randint(0, 3, size=(10, 5)), columns=list('ABCDE'))
...: data
Out[12]:
A B C D E
0 2 0 1 2 0
1 1 0 1 2 1
2 2 1 2 1 0
3 1 0 1 2 2
4 1 1 0 0 2
5 2 2 1 0 2
6 2 1 1 2 2
7 0 0 0 1 0
8 2 2 0 0 1
9 2 0 0 2 1
In [13]: ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
Out[13]:
A 2
B 0
C 0
D 1
E 2
dtype: int64

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

How to drop conflicted rows in Dataframe?

I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.
I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0

Delete a row if it doesn't contain a specified integer value (Pandas)

I have a Pandas dataset that I want to clean up prior to applying my ML algorithm. I am wondering if it was possible to remove a row if an element of its columns does not match a set of values. For example, if I have the dataframe:
a b
0 1 6
1 4 7
2 2 4
3 3 7
...
And I desire the values of a to be one of [1,3] and of b to be one of [6,7], such that my final dataset is:
a b
0 1 6
1 3 7
...
Currently, my implementation is not working as some of my data rows have erroneous strings attached to the value. For example, instead of a value of 1 I'll have something like 1abc. Hence why I would like to remove anything that is not an integer of that value.
My workaround is also a bit archaic, as I am removing entries for column a that do not have 1 or 3 via:
dataset = dataset[(dataset.commute != 1)]
dataset = dataset[(dataset.commute != 3)]
You can use boolean indexing with double isin and &:
df1 = df[(df['a'].isin([1,3])) & (df['b'].isin([6,7]))]
print (df1)
a b
0 1 6
3 3 7
Or use numpy.in1d:
df1 = df[(np.in1d(df['a'], [1,3])) & (np.in1d(df['b'], [6,7])) ]
print (df1)
a b
0 1 6
3 3 7
But if need remove all rows with non numeric then need to_numeric with errors='coerce' which return NaN and then is possible filter it by notnull:
df = pd.DataFrame({'a':['1abc','2','3'],
'b':['4','5','dsws7']})
print (df)
a b
0 1abc 4
1 2 5
2 3 dsws7
mask = pd.to_numeric(df['a'], errors='coerce').notnull() &
pd.to_numeric(df['b'], errors='coerce').notnull()
df1 = df[mask].astype(int)
print (df1)
a b
1 2 5
If need check if some value is NaN or None:
df = pd.DataFrame({'a':['1abc',None,'3'],
'b':['4','5',np.nan]})
print (df)
a b
0 1abc 4
1 None 5
2 3 NaN
print (df[df.isnull().any(axis=1)])
a b
1 None 5
2 3 NaN
You can use pandas isin()
df = df[df.a.isin([1,3]) & df.b.isin([6,7])]
a b
0 1 6
3 3 7

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Categories