Viewing duplicated rows in Pandas - python

I know that if I have a DataFrame object in Pandas that I can find out if the row is a duplicate by using the .duplicated() method on the DataFrame. This will return a Series giving True or False depending on whether the row was a duplicate or not. My question is, is it then possible to index the original DataFrame with this object, such that I only return the duplicates (so that I can visually inspect them)?

In [18]: df = pd.DataFrame(np.random.randint(0, 2, (10, 4)))
In [19]: df
Out[19]:
0 1 2 3
0 0 1 1 0
1 0 1 1 1
2 0 1 1 1
3 1 1 0 0
4 0 1 0 1
5 1 0 1 0
6 0 1 0 1
7 1 1 1 0
8 0 1 1 0
9 0 0 0 1
[10 rows x 4 columns]
In [20]: df[df.duplicated()]
Out[20]:
0 1 2 3
2 0 1 1 1
6 0 1 0 1
8 0 1 1 0
[3 rows x 4 columns]

Related

Overwrite data frame value

I have two data frame df and ddff
df data frame have 3 row and 5 columns
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0,1,0,0,1], [1,0,0,1,0], [0,1,1,0,0]]))
df
0 1 2 3 4
0 0 1 0 0 1
1 1 0 0 1 0
2 0 1 1 0 0
ddff data frame consist of neighbour columns of a particular columns which have 5 row and 3 column where the value of ddff data frame represent the column name of df
ddff = pd.DataFrame(np.array([[3,2,1], [4,2,3], [3,1,4], [4,1,2], [2,3,1]]))
ddff
0 1 2
0 3 2 1
1 4 2 3
2 3 1 4
3 4 1 2
4 2 3 1
Now I need a final data frame where where df column neighbour's set to 1 (overwrite previous value)
expected output
0 1 2 3 4
0 0 1 1 1 0
1 1 0 0 1 0
2 0 1 1 0 0
You can filter the relevant column numbers from ddff, and set the values in those columns in the first row equal to 1 and set the values in the remaining columns to 0:
relevant_columns = ddff.loc[0]
df.loc[0,relevant_columns] = 1
df.loc[0,df.columns[~df.columns.isin(relevant_columns)]] = 0
Output:
0 1 2 3 4
0 0 1 1 1 0
1 1 0 0 1 0
2 0 1 1 0 0
You can use:
s = ddff.loc[0].values
df.loc[0] = np.where(df.loc[[0]].columns.isin(s),1,0)
>>> df
0 1 2 3 4
0 0 1 1 1 0
1 1 0 0 1 0
2 0 1 1 0 0
Breaking it down:
>>> np.where(df.loc[[0]].columns.isin(s),1,0)
array([0, 1, 1, 1, 0])
# Before the update
>>> df.loc[0]
0 0
1 1
2 0
3 0
4 1
# After the assignment back
0 0
1 1
2 1
3 1
4 0

Pandas: sort according to a row

I have a Dataframe like this (with labels on rows and columns):
0 1 2 3
0 1 1 0 0
1 0 1 1 0
2 1 0 1 0
-1 5 6 3 2
I would like to order the columns according to the last row (and then drop the row):
0 1 2 3
0 1 1 0 0
1 1 0 1 0
2 0 1 1 0
Try np.argsort to get the order, then iloc to rearrange columns and drop rows:
df.iloc[:-1, np.argsort(-df.iloc[-1])]
Output:
1 0 2 3
0 1 1 0 0
1 1 0 1 0
2 0 1 1 0

Using previous row value while creating a new column

I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)
Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64
Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0

Get dataframe rows where every T sequence of columns there's 1 N times

How to get the rows of a Dataframe that in every sequence of 5 columns there is at least 3 times the number one?
The dataframe filled with 1's and 0's (no missing values).
Example:
Also, a fast approch will be helpful due to millions of lines and tens of cols I need to check.
Create a rolling sum width a width of 5, look at all columns from the 5th to the end and select them if the values are always 3 or above:
rolling_sum = df.rolling(5, min_periods=1, axis=1).sum()
select = (rolling_sum.iloc[:, 4:] >= 3).all(axis=1)
In [92]: df
Out[92]:
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 0 1 0 0 1 0 1 1 0 0
2 0 1 0 1 1 0 0 1 0 0
3 0 1 1 1 0 1 1 1 1 1
4 0 1 0 1 1 1 0 0 1 1
5 0 0 1 1 1 0 1 1 1 0
In [94]: (df.rolling(5, min_periods=1, axis=1).sum().iloc[:, 4:] >= 3).all(axis=1)
Out[94]:
0 False
1 False
2 False
3 True
4 True
5 True
dtype: bool
Reshape underlying array data to 3D such that the last axis has 5 elements each representing the blocks of 5 each and then sum along that axis to give us sum for each of those blocks and finally use any reduction along the second axis which represents each row from the original dataframe -
df['result'] = (df.values.reshape(-1,df.shape[1]//5,5).sum(2)>=3).any(1)
For performance, you might want to work with the boolean array :df.values==1 instead of df.values.
Sample run -
In [41]: df
Out[41]:
0 1 2 3 4 5 6 7 8 9
0 0 1 1 0 0 1 0 0 0 1
1 0 0 0 0 0 0 1 0 1 1
2 0 1 1 0 0 1 1 0 0 1
3 1 1 1 1 0 0 0 1 0 1
4 0 1 1 1 0 1 1 1 1 0
5 0 0 0 0 1 0 0 1 1 1
6 0 0 1 0 1 1 0 0 0 1
In [42]: df['result'] = (df.values.reshape(-1,df.shape[1]//5,5).sum(2)>=3).any(1)
In [43]: df
Out[43]:
0 1 2 3 4 5 6 7 8 9 result
0 0 1 1 0 0 1 0 0 0 1 False
1 0 0 0 0 0 0 1 0 1 1 True
2 0 1 1 0 0 1 1 0 0 1 True
3 1 1 1 1 0 0 0 1 0 1 True
4 0 1 1 1 0 1 1 1 1 0 True
5 0 0 0 0 1 0 0 1 1 1 True
6 0 0 1 0 1 1 0 0 0 1 False
If the number of cols isn't a multiple of 5, we can use np.add.reduceat -
idx = np.arange(0,df.shape[1],5)
df['result'] = (np.add.reduceat(df.values, idx, axis=1)>=3).any(1)
Timings on millions rows and tens of cols -
In [99]: np.random.seed(0)
...: a = (np.random.rand(1000000,20)>0.6).astype(int)
...: df = pd.DataFrame(a)
# Solution from this post
In [101]: %timeit (df.values.reshape(-1,df.shape[1]//5,5).sum(2)>=3).any(1)
10 loops, best of 3: 65.3 ms per loop
# #w-m's soln
In [102]: %timeit (df.rolling(5, min_periods=1, axis=1).sum().iloc[:, 4:] >= 3).all(axis=1)
1 loop, best of 3: 8.04 s per loop

Problems with pandas and numpy where condition/multiple values?

I have the follwoing pandas dataframe:
A B
1 3
0 3
1 2
0 1
0 0
1 4
....
0 0
I would like to add a new column at the right side, following the following condition:
If the value in B has 3 or 2 add 1 in the new_col for instance:
(*)
A B new_col
1 3 1
0 3 1
1 2 1
0 1 0
0 0 0
1 4 0
....
0 0 0
So I tried the following:
df['new_col'] = np.where(df['B'] == 3 & 2,'1','0')
However it did not worked:
A B new_col
1 3 0
0 3 0
1 2 1
0 1 0
0 0 0
1 4 0
....
0 0 0
Any idea of how to do a multiple contidition statement with pandas and numpy like (*)?.
You can use Pandas isin which will return a boolean showing whether the elements you're looking for are contained in column 'B'.
df['new_col'] = df['B'].isin([3, 2])
A B new_col
0 1 3 True
1 0 3 True
2 1 2 True
3 0 1 False
4 0 0 False
5 1 4 False
Then, you can use astype to convert the boolean values to 0 and 1, True being 1 and False being 0
df['new_col'] = df['B'].isin([3, 2]).astype(int)
Output:
A B new_col
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0
Using numpy:
>>> df['new_col'] = np.where(np.logical_or(df['B'] == 3, df['B'] == 2), '1','0')
>>> df
A B new_col
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0
df['new_col'] = [1 if x in [2, 3] else 0 for x in df.B]
The operators * + ^ work on booleans as expected, and mixing with integers give the expected result. So you can also do:
df['new_col'] = [(x in [2, 3]) * 1 for x in df.B]
using numpy
df['new'] = (df.B.values[:, None] == np.array([2, 3])).any(1) * 1
Timing
over given data set
over 60,000 rows
df=pd.DataFrame({'A':[1,0,1,0,0,1],'B':[3,3,2,1,0,4]})
print df
df['C']=[1 if vals==2 or vals==3 else 0 for vals in df['B'] ]
print df
A B
0 1 3
1 0 3
2 1 2
3 0 1
4 0 0
5 1 4
A B C
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0

Categories