Remove rows where values appear in all columns in Pandas - python

Here is a very simple dataframe:
df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
I'm trying to remove rows where there are duplicate values (e.g., row 3)
This doesn't work,
df = df[(df.col1 != 3 & df.col2 != 3)]
and the documentation is pretty clear about why, which makes sense.
But I still don't know how to delete that row.
Does anyone have any ideas? Thanks. Monica.

If I understand your question correctly, I think you were close.
Starting from your data:
In [20]: df
Out[20]:
col1 col2
0 1 1
1 2 3
2 3 3
And doing this:
In [21]: df = df[df['col1'] != df['col2']]
Returns:
In [22]: df
Out[22]:
col1 col2
1 2 3

What about:
In [43]: df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
In [44]: df[df.max(axis=1) != df.min(axis=1)]
Out[44]:
col1 col2
1 2 3
[1 rows x 2 columns]
We want to remove rows whose values show up in all columns, or in other words the values are equal => their minimums and maximums are equal. This is method works on a DataFrame with any number of columns. If we apply the above, we remove rows 0 and 2.

Any row with all the same values with have zero as the standard deviation. One way to filter them is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1' :[1, 2, 3, np.nan],
'col2' :[1, 3, 3, np.nan]}
>>> df.loc[df.std(axis=1, skipna=False) > 0]
col1 col2
1 2

Related

How to switch column values in the same Pandas DataFrame

I have the following DataFrame:
I need to switch values of col2 and col3 with the values of col4 and col5. Values of col1 will remain the same. The end result needs to look as the following:
Is there a way to do this without looping through the DataFrame?
Use rename in pandas
In [160]: df = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]})
In [161]: df
Out[161]:
A B
0 1 3
1 2 4
2 3 5
In [167]: df.rename({'B':'A','A':'B'},axis=1)
Out[167]:
B A
0 1 3
1 2 4
2 3 5
This should do:
og_cols = df.columns
new_cols = [df.columns[0], *df.columns[3:], *df.columns[1:3]]
df = df[new_cols] # Sort columns in the desired order
df.columns = og_cols # Use original column names
If you want to swap the column values:
df.iloc[:, 1:3], df.iloc[:, 3:] = df.iloc[:,3:].to_numpy(copy=True), df.iloc[:,1:3].to_numpy(copy=True)
Pandas reindex could help :
cols = df.columns
#reposition the columns
df = df.reindex(columns=['col1','col4','col5','col2','col3'])
#pass in new names
df.columns = cols

Element-wise Comparison of Two Pandas Dataframes

I am trying to compare two columns in pandas. I know I can do:
# either using Pandas' equals()
df1[col].equals(df2[col])
# or this
df1[col] == df2[col]
However, what I am looking for is to compare these columns elment-wise and when they are not matching print out both values. I have tried:
if df1[col] != df2[col]:
print(df1[col])
print(df2[col])
where I get the error for 'The truth value of a Series is ambiguous'
I believe this is because the column is treated as a series of boolean values for the comparison which causes the ambiguity. I also tried various forms of for loops which did not resolve the issue.
Can anyone point me to how I should go about doing what I described?
This might work for you:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'col1': [1, 2, 9, 4, 7]})
if not df2[df2['col1'] != df1['col1']].empty:
print(df1[df1['col1'] != df2['col1']])
print(df2[df2['col1'] != df1['col1']])
Output:
col1
2 3
4 5
col1
2 9
4 7
You need to get hold of the index where the column values are not matching. Once you have that index then you can query the individual DFs to get the values.
Please try the fallowing and is if this helps:
for ind in (df1.loc[df1['col1'] != df2['col1']].index):
x = df1.loc[df1.index == ind, 'col1'].values[0]
y = df2.loc[df2.index == ind, 'col1'].values[0]
print(x, y )
Solution
Try this. You could use any of the following one-line solutions.
# Option-1
df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
# Option-2
df.loc[df[col1]!=df[col2], [col1, col2]]
Logic:
Option-1: We use pandas.DataFrame.apply() to evaluate the target columns row by row and pass the returned indices to df.loc[indices, [col1, col2]] and that returns the required set of rows where col1 != col2.
Option-2: We get the indices with df[col1] != df[col2] and the rest of the logic is the same as Option-1.
Dummy Data
I made the dummy data such that for indices: 2,6,8 we will find column 'a' and 'c' to be different. Thus, we want only those rows returned by the solution.
import numpy as np
import pandas as pd
a = np.arange(10)
c = a.copy()
c[[2,6,8]] = [0,20,40]
df = pd.DataFrame({'a': a, 'b': a**2, 'c': c})
print(df)
Output:
a b c
0 0 0 0
1 1 1 1
2 2 4 0
3 3 9 3
4 4 16 4
5 5 25 5
6 6 36 20
7 7 49 7
8 8 64 40
9 9 81 9
Applying the solution to the dummy data
We see that the solution proposed returns the result as expected.
col1, col2 = 'a', 'c'
result = df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
print(result)
Output:
a c
2 2 0
6 6 20
8 8 40

Count across dataframe columns based on str.contains (or similar)

I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.
I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below
d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)
#can correctly count across rows using equality
thisworks =( df =="a#" ).sum(axis=1)
#can count across a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()
#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)
Output should be a series showing the number of cells in each row that contain the given character string.
str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:
df.agg(lambda x: x.str.contains('#')).sum(1)
Out[2358]:
0 1
1 0
2 2
dtype: int64
If you don't like agg nor apply, you may use np.char.find to work directly on underlying numpy array of df
(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)
Out[2360]: array([1, 0, 2])
Passing it to series or a columns of df
pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)
Out[2361]:
0 1
1 0
2 2
dtype: int32
A solution using df.apply:
df = pd.DataFrame({'col1': ["a#", "b","c#"],
'col2': ["a", "b","c#"]})
df
col1 col2
0 a# a
1 b b
2 c# c#
df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)
col1 col2 sum
0 a# a 1
1 b b 0
2 c# c# 2
Something like this should work:
df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
df['col2'].str.contains('#', regex=False).astype(int)
df
# col1 col2 totals
# 0 # # 2
# 1 0 # 1
It should generalize to as many columns as you want.

update dataframe with series

having a dataframe, I want to update subset of columns with a series of same length as number of columns being updated:
>>> df = pd.DataFrame(np.random.randint(0,5,(6, 2)), columns=['col1','col2'])
>>> df
col1 col2
0 1 0
1 2 4
2 4 4
3 4 0
4 0 0
5 3 1
>>> df.loc[:,['col1','col2']] = pd.Series([0,1])
...
ValueError: shape mismatch: value array of shape (6,) could not be broadcast to indexing result of shape (2,6)
it fails, however, I am able to do the same thing using list:
>>> df.loc[:,['col1','col2']] = list(pd.Series([0,1]))
>>> df
col1 col2
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
5 0 1
could you please help me to understand, why updating with series fails? do I have to perform some particular reshaping?
When assigning with a pandas object, pandas treats the assignment more "rigorously". A pandas to pandas assignment must pass stricter protocols. Only when you turn it to a list (or equivalently pd.Series([0, 1]).values) did pandas give in and allow you to assign in the way you'd imagine it should work.
That higher standard of assignment requires that the indices line up as well, so even if you had the right shape, it still wouldn't have worked without the correct indices.
df.loc[:, ['col1', 'col2']] = pd.DataFrame([[0, 1] for _ in range(6)])
df
df.loc[:, ['col1', 'col2']] = pd.DataFrame([[0, 1] for _ in range(6)], columns=['col1', 'col2'])
df

Filter rows of pandas dataframe whose values are lower than 0

I have a pandas dataframe like this
df = pd.DataFrame(data=[[21, 1],[32, -4],[-4, 14],[3, 17],[-7,NaN]], columns=['a', 'b'])
df
I want to be able to remove all rows with negative values in a list of columns and conserving rows with NaN.
In my example there is only 2 columns, but I have more in my dataset, so I can't do it one by one.
If you want to apply it to all columns, do df[df > 0] with dropna():
>>> df[df > 0].dropna()
a b
0 21 1
3 3 17
If you know what columns to apply it to, then do for only those cols with df[df[cols] > 0]:
>>> cols = ['b']
>>> df[cols] = df[df[cols] > 0][cols]
>>> df.dropna()
a b
0 21 1
2 -4 14
3 3 17
I've found you can simplify the answer by just doing this:
>>> cols = ['b']
>>> df = df[df[cols] > 0]
dropna() is not an in-place method, so you have to store the result.
>>> df = df.dropna()
I was looking for a solution for this that doesn't change the dtype (which will happen if NaN's are mixed in with ints as suggested in the answers that use dropna. Since the questioner already had a NaN in their data, that may not be an issue for them. I went with this solution which preserves the int64 dtype. Here it is with my sample data:
df = pd.DataFrame(data={'a':[0, 1, 2], 'b': [-1,0,1], 'c': [-2, -1, 0]})
columns = ['b', 'c']
filter_ = (df[columns] >= 0).all(axis=1)
df[filter_]
a b c
2 2 1 0

Categories