How to find duplicates in pandas dataframe and print them - python

I am checking a panadas dataframe for duplicate rows using the duplicated function, which works well. But how do I print out the row contents of only the items that are true?
for example, If I run:
duplicateCheck = dataSet.duplicated(subset=['Name', 'Date',], keep=False)
print(duplicateCheck)
it outputs:
0 False
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 False
I'm looking for something like:
for row in duplicateCheck.keys():
if row == True:
print (row, duplicateCheck[row])
Which prints the items from the dataframe that are duplicates.

Why not
duplicateCheck = dataSet.duplicated(subset=['Name', 'Date',], keep=False)
print(dataSet[duplicateCheck])

Related

Annotating maximum by iterating each rows and make new column with resultant output

Annotating maximum by iterating each rows. and make new column with resultant output.
Can anyone help using pandas in Python, how to get the result?
text A B C
index
0 Cool False False True
1 Drunk True False False
2 Study False True False
Output:
Text Result
index
0 Cool False
1 Drunk False
2 Study False
If the sum of each row is more than half the length of the columns, True is the more common value.
Try:
df["Result"] = df.drop("text", axis=1).sum(axis=1)>=len(df.columns)//2+1
output = df[["text", "Result"]]
>>> df
text Result
0 Cool False
1 Drunk False
2 Study False

How to subset output of pandas contain statement to give all True values?

How to subset output of pandas contain statement to give all True values?
Code
df_2clean["p2_conf"].astype(str).str.contains(r'[^0-9+-:.\s]')
Output
0 False
1 False
2 False
3 False
4 True
Try this:
df_2subset=df_2clean[df_2clean["p2_conf"].astype(str).str.contains(r'[^0-9+-:.\s]')==True]

Create columns in dataframe based on csv field

I have a pandas dataframe with the column "Values" that has comma separated values:
Row|Values
1|1,2,3,8
2|1,4
I want to create columns based on the CSV, and assign a boolean indicating if the row has that value, as follows:
Row|1,2,3,4,8
1|true,true,true,false,true
2|true,false,false,true,false
How can I accomplish that?
Thanks in advance
Just using get_dummies, check the link here and the astype(bool) change 1 to True 0 to False
df.set_index('Row')['Values'].str.get_dummies(',').astype(bool)
Out[318]:
1 2 3 4 8
Row
1 True True True False True
2 True False False True False

How to assert that a pandas data frame filtered on a condition is true

So I have a pytest testing the results of a query that returns pandas dataframe.
I want to assert that a particular column col has all the values that are a substring of a given input.
So this below gives me the rows (dataframe) that have that column's col value containing some input part. How can I assert it to be true?
assert result_df[result_df['col'].astype(str).str.contains(category)].bool == True
doesn't work
Try this:
assert result_df[result_df['col'].astype(str).str.contains(category)].bool.all(axis=None) == True
Please refer to the pandas docs for more info: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.all.html
The reason your code doesn't work is because you are trying to test whether the dataframe object is True, not all of the values in it.
I believe you need Series.all for check if all values of filtered Series are Trues:
assert result_df['col'].astype(str).str.contains(category).all()
Sample:
result_df = pd.DataFrame({
'col':list('aaabbb')
})
print (result_df)
col
0 a
1 a
2 a
3 b
4 b
5 b
category = 'b'
assert result_df['col'].astype(str).str.contains(category).all()
AssertionError
Detail:
print (result_df['col'].astype(str).str.contains(category))
0 False
1 False
2 False
3 True
4 True
5 True
Name: col, dtype: bool
print (result_df['col'].astype(str).str.contains(category).all())
False
category = 'a|b'
assert result_df['col'].astype(str).str.contains(category).all()
print (result_df['col'].astype(str).str.contains(category))
0 True
1 True
2 True
3 True
4 True
5 True
Name: col, dtype: bool
print (result_df['col'].astype(str).str.contains(category).all())
True
Found it. assert result_df[result_df['col'].astype(str).str.contains(category)].bool works
or assert result_df['col'].astype(str).str.contains(category).all (Thanks to #jezrael for suggesting all)

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

Categories