How to drop conflicted rows in Dataframe? - python

I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.

I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0

Related

Pandas Counting Each Column with its Spesific Thresholds

If I have a following dataframe:
A B C D E
1 1 2 0 1 0
2 0 0 0 1 -1
3 1 1 3 -5 2
4 -3 4 2 6 0
5 2 4 1 9 -1
T 1 2 2 4 1
The last row is my threshold values for each column. I want to count each column values whether lower its threshold values or not in python pandas.
Desired Output is;
A B C D E
Count 2 2 3 3 4
But, I need to figure it out with a general solution, not for these specific columns. Because I have a large dataset. I cannot specify a column name for each of them in the code.
Could you please help me with this?
Select all rows without first by indexing and compare by DataFrame.lt by last row, then sum and convert Series to one row DataFrame by Series.to_frame with transpose by DataFrame.T:
df = df.iloc[:-1].lt(df.iloc[-1]).sum().to_frame('count').T
print (df)
A B C D E
count 2 2 3 3 4
Numpy alternative with DataFrame constructor:
arr = df.values
df = pd.DataFrame([np.sum(arr[:-1] < arr[-1], axis=0)], columns=df.columns, index=['count'])
print (df)
A B C D E
count 2 2 3 3 4

Is there an elegant way to remap group values into an incremental series in a pandas DataFrame?

Is there an elegant way to reassign group values to increasing ones?
I have a table which has is already in order:
X = pandas.DataFrame([['a',2],['b',4],['ba',4],['c',8]],columns=['value','group'])
X
Out[18]:
value group
0 a 2
1 b 4
2 ba 4
3 c 8
But I would like to remap group values to that they would increase one by one. The end result would look like:
value group
0 a 1
1 b 2
2 ba 2
3 c 3
Using category or factorize
X.group.astype('category').cat.codes+1 # pd.factorize(X.group)[0]+1
Out[105]:
0 1
1 2
2 2
3 3
dtype: int8

pandas: Grouping or filtering based on values in list, instead of dataframe

I want to get a row count of the frequency of each value, even if that value doesn't exist in the dataframe.
d = {'light' : pd.Series(['b','b','c','a','a','a','a'], index=[1,2,3,4,5,6,9]),'injury' : pd.Series([1,5,5,5,2,2,4], index=[1,2,3,4,5,6,9])}
testdf = pd.DataFrame(d)
injury light
1 1 b
2 5 b
3 5 c
4 5 a
5 2 a
6 2 a
9 4 a
I want to get a count of the number of occurrences of each unique value of 'injury' for each unique value in 'light'.
Normally I would just use groupby(), or (in this case, since I want it to be in a specific format), pivot_table:
testdf.reset_index().pivot_table(index='light',columns='injury',fill_value=0,aggfunc='count')
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
But in this case I actually want to compare the records in the dataframe to an external list of values-- in this case, ['a','b','c','d']. So if 'd' doesn't exist in this dataframe, then I want it to return a count of zero:
index
injury 1 2 4 5
light
a 0 2 1 1
b 1 0 0 1
c 0 0 0 1
d 0 0 0 0
The closest I've come is filtering the dataframe based on each value, and then getting the size of that dataframe:
for v in sorted(['a','b','c','d']):
idx2 = (df['light'].isin([v]))
df2 = df[idx2]
print(df2.shape[0])
4
2
1
0
But that only returns counts from the 'light' column-- instead of a cross-tabulation of both columns.
Is there a way to make a pivot table, or a groupby() object, that groups things based on values in a list, rather than in a column in a dataframe? Or is there a better way to do this?
Try this:
df = pd.crosstab(df.light, df.injury,margins=True)
df
injury 1 2 4 5 All
light
a 0 2 1 1 4
b 1 0 0 1 2
c 0 0 0 1 1
All 1 2 1 3 7
df["All"]
light
a 4
b 2
c 1
All 7

Pandas: Combination of two data frames

I have two data frames, old and new. Both have identical columns.
I want to, by the index,
Add rows to old that exist in new but not in old
Update rows at old with data in new.
Is there any efficient way of doing so in pandas? I found update(), which does exactly the second step. However, it doesn't add rows. I could do the first step using some set logic onto the indices. However, that does not appear to efficient. What's the best way to do these two operations?
Example
old
a b
0 1 1
1 3 3
new
a b
1 1 2
2 1 2
result
a b
0 1 1
1 1 2
2 1 2
You could first find common indices for both dataframes then for first with that indices assign values of the second. And then you'll get the result with combine_first:
In [35]: df1
Out[35]:
a b
0 1 1
1 3 3
In [36]: df2
Out[36]:
a b
1 1 2
2 1 2
idx = df1.index & df2.index
df1.loc[idx, :] = df2.loc[idx, :]
df1 = df1.combine_first(df2)
In [39]: df1
Out[39]:
a b
0 1 1
1 1 2
2 1 2
you can do the first step using df.reindex()method
old = old.reindex(index=new.index)

Drop Rows by Multiple Column Criteria in DataFrame

I have a pandas dataframe that I'm trying to drop rows based on a criteria across select columns. If the values in these select columns are zero, the rows should be dropped. Here is an example.
import pandas as pd
t = pd.DataFrame({'a':[1,0,0,2],'b':[1,2,0,0],'c':[1,2,3,4]})
a b c
0 1 1 1
1 0 2 2
2 0 0 3
3 2 0 4
I would like to try something like:
cols_of_interest = ['a','b'] #Drop rows if zero in all these columns
t = t[t[cols_of_interest]!=0]
This doesn't drop the rows, so I tried:
t = t.drop(t[t[cols_of_interest]==0].index)
And all rows are dropped.
What I would like to end up with is:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
Where the 3rd row (index 2) was dropped because it took on value 0 in BOTH the columns of interest, not just one.
Your problem here is that you first assigned the result of your boolean condition: t = t[t[cols_of_interest]!=0] which overwrites your original df and sets where the condition is not met with NaN values.
What you want to do is generate the boolean mask, then drop the NaN rows and pass thresh=1 so that there must be at least a single non-NaN value in that row, we can then use loc and use the index of this to get the desired df:
In [124]:
cols_of_interest = ['a','b']
t.loc[t[t[cols_of_interest]!=0].dropna(thresh=1).index]
Out[124]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4
EDIT
As pointed out by #DSM you can achieve this simply by using any and passing axis=1 to test the condition and use this to index into your df:
In [125]:
t[(t[cols_of_interest] != 0).any(axis=1)]
Out[125]:
a b c
0 1 1 1
1 0 2 2
3 2 0 4

Categories