Finding duplicates across multiple groups in a column- Python

Finding duplicates across multiple groups in a column- Python - python

I need really serious help with some code.
I've a dataframe in which I want to find duplicates across 2 columns- Material Part Number & Manufacturer.The columns have null values. The way I need to find duplicates is as follows.
I first check the Part Number column for rows with no null values. As I do not want the null values to be treated as duplicates
If the part numbers match, for two same part numbers, I check the manufacturer column for duplicates.
Incase both the manufacturer and the part numbers are the same for two or more rows, I output result into a new column called level of duplicacy. The output is high for all the rows in which the part numbers and the manufacturers have an exact match.
However, if the part numbers match and the manufacturer doesn't match, the output into the column is 'Moderate'.
If the part number itself doesn't match, then Level of Duplicacy is 'No Duplicate'
Also, For rows which are NA in the part number or manufacturer, put the level of duplicacy as 'No Duplicate' incase of Part number and Moderate incase of Manufacturer.
This is my input table
enter image description here
The code I have written for the same is
`for i in range(len(df)):
if pd.isnull(df.loc[i,'Material Part Number'])==False:
if (df['Material Part Number'].duplicated(keep=False))[i]==True:
if pd.isnull(df.loc[i,'Manufacturer'])==False:
if (df['Manufacturer'].duplicated(keep=False))[i]==True:
df.loc[i,'Level of Duplicacy']='High'
else:
df.loc[i,'Level of Duplicacy']='Moderate'
else:
df.loc[i,'Level of Duplicacy']='Moderate'
else:
df.loc[i,'Level of Duplicacy']='Not duplicate'
else:
df.loc[i,'Level of Duplicacy']='Not duplicate'`
The output I need is
enter image description here
The output I'm getting is
enter image description here
As you can see in the rows highlighted in yellow, my code isn't comparing manufacturers within one particular/unique part number but it's doing it across all the part numbers and I don't want it to do that. I know that .duplicated() function compares for the entire column but what if I want it to compare within each unique part number and then find a match? More of a groupby with duplicated? Can one of you help me modify the code I have written?
Thanks a lot.

Running a loop through the data frame would require you to do element wise comparisons of each item. I would suggest using matrix algebra to achieve this. Have a look below, this might be helpful.
df["Level of Duplicacy"] = "Not Duplicate"
Partdups = df.loc[df["Material Part Number"].duplicated(),"Material Part Number"].unique()
for dup in Partdups:
Nums = df.loc[df["Material Part Number"] == dup,:]
dupNums = Nums.loc[Nums["Manufacturer"].duplicated(),"Manufacturer"].unique()
for num in dupNums:
Nums.loc[Nums["Manufacturer"] == num,"Level of Duplicacy"] = "High"
Nums.loc[Nums["Manufacturer"] != num,"Level of Duplicacy"] = "Moderate"
df.iloc[Nums["Material Part Number"].index,:] = Nums
df.loc[pd.isnull(df["Material Part Number"]),"Level of Duplicacy"] = "Not Duplicate"

Related

What's the logic behind locating elements using letters in pandas?

I have a CSV file. I load it in pandas dataframe. Now, I am practicing the loc method. This CSV file contains a list of James bond movies and I am passing letters in the loc method. I could not interpret the result shown.
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)
bond.loc["A": "I"]
The result for the above code is:
bond.loc["a": "i"]
And the result for the above code is:
What is happening here? I could not understand. Please someone help me to understand the properties of pandas.
Following is the file:

Your dataframe uses the first column ("Film") as an index when it is imported (because of the option index_col = "Film"). The column contains the name of each film stored as a string, and they all start with a capital letter. bond.loc["A":"I"] returns all films where the index is greater than or equal to "A" and less than or equal to "I" (pandas slices are upper-bound inclusive), which by the rules of string comparison in Python includes all films beginning with "A"-"H", and would also include a film called "I" if there was one. If you enter e.g. "A" <= "b" <="I" in the python prompt you will see that lower-case letters are not within the range, because ord("b") > ord("I").
If you wrote bond.index = bond.index.str.lower() that would change the index to lower case and you could search films using e.g. bond["a":"i"] (but bond["A":"I"] would no longer return any films).

DataFrame.loc["A":"I"] returns the rows that start with the letter in that range - from what I can see and tried to reproduce. Might you attach the data?

Need to find second maximum number in the list. All the details pasted below in the body. Please assist

Need to print second Maximum Number in a given List
Description - Given a list of numbers, find the second largest number in the list.
Note:- There might be repeated numbers in the list. If there is only one number present in the list, return 'not present'.
I have tried to directly sort it but not able to make the not present condition

An easy implementation would be to do something like:
if len(set(input)) == 1:
print('not present')
else:
sorted(set(input))[-2]
Take a look at Get the second largest number in a list in linear time for other implementations.

Finding rows with specific words in an excel sheet [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a table similar to this format:
Table
My table has 50000 rows and 800 columns, certain cells (tab-separated) contain multiple comma-separated words (e.g L,N). I want to retain only rows that contain one of a set of words (say A to N) at a given column (say col2) and remove the remaining rows.
Is it possible to do using Vlookup or there is another way to do it? Any suggestion is appreciated.

Create a helper column, perhaps on the extreme right of a copy of your worksheet. Enter this formula in row 2 of that column. Modify the formula to replace column C with the ID of the helper column (the one in which you write the formula) and replace B with the column in which to find the word. Cope the formula all the way down.
=ISERROR(FIND(C$1,$B2))
Now enter the word you want to keep in cell(1) of the helper column (C$1 in my example). The column will fill with TRUE and FALSE.
TRUE means that the word wasn't found and the row should be deleted
FALSE means that the word exists and should be kept
Now sort the sheet on that column and delete the block with TRUE in it. 15 seconds flat, after you have done it for a few times. That's faster than any VBA or r or Python solution could be made to run.

The biggest issue I had was in communicating conveying what I did, how and why. No need to, so deleted.
So, Select any part, section or range of your table and run the code.
You can remove the section of the code below 'clear any existng fiters' if you dont need to delete the found data and want to keep in your sheet, still tagged for additional purposes (like you could copy those bits to another file if you want, etc).
The code below should do what you asked in your question, and leave you with just the parts of the table that DONT contain, in your selection, any of your terms. i.e. it will delete and move up rows in your table according to your criteria.
& yes Python & R can simply do this for you too too, with dataframes in python, with less code. But this VBA code worked for my many examples. Don't know how it will fare with 50000 rows and X columns, but it should be alright {Post edit: It works fine}.
Sub SearchTableorSelectionDeletetermsFound5()
Dim corresspondingpartner() As Variant
Dim rng As Range: Set rng = Selection
Dim col As Range
For Each col In rng.Columns
Dim r As Range
Dim rn As Variant
Dim Rownum As Long
Rownum = Selection.Rows.Count
ReDim rm(0 To Rownum) As Variant 'size of this array needs to equal or bigger than your selection
'With Sheet2
terms = Sheets("Sheet2").Cells(1, 1).CurrentRegion
k = 1
For rw = 0 To UBound(terms)
ReDim Preserve corresspondingpartner(rw)
corresspondingpartner(rw) = (k / k) 'gives each correspondong partner element an id of 1.
k = k + 1
Next
'End With
For Each r In Selection
n = 0
m = n
For Each c In terms
' Checks for each term in turn in the terms column.
' If it finds one, it inserts the corresponding corresspondingpartner name in the column cell/corresponding row column O (*post edit: now, column ADU)*
If r.Offset(0, 0).Value Like "*" & c & "*" Then
rm(n) = corresspondingpartner(n) 'corresspondingpartner(n) and in the end, you dont even need this, you can replace with any value which the auto fiter section looks for to delete
'so you can remove any instances and classees of corresspondingpartner, including the making of this corresponding second array
'turns out it could have been just if =1
Cells(r.Row, 801).Value = rm(n) / rm(n) 'Sheets("HXY2").
'##### YOU DONT EVEN NEED A HLOOKUP! :)
'#### BUT I STILL WANT TO GET RID OF THE OFFSET COLUMS, DO IT WITHOUT THEM. DONE!! :)
'''###''' same here , turns out could have just been =1
End If
n = n + 1
Next
Next
Next col
'Clear any existing filters
On Error Resume Next
ActiveSheet.ShowAllData
On Error GoTo 0
'1. Apply Filter
ActiveSheet.Range("A1:ADU5000").AutoFilter Field:=801, Criteria1:=corresspondingpartner(n) / corresspondingpartner(n)
'2. Delete Rows
Application.DisplayAlerts = False
ActiveSheet.Range("A1:ADU5000").SpecialCells(xlCellTypeVisible).Delete
Application.DisplayAlerts = True
'3. Clear Filter
On Error Resume Next
ActiveSheet.ShowAllData
On Error GoTo 0
End Sub
You might be able to see that in the beginning I was working with to print offset column results, from the table/selection - which took up unneccisary space and also was employing a VBA Application.WorksheetFunction.Hlookup in the code to give a final result column, tagging the rows to delete, but that was in the end unneccissary. Those earlier versions/macros worked too, but were slower, so I did it without the need for helper columns, using arrays.
I turned to my friend, [excel campus - delete rows based on conditions][1] to embed and tweak the autoflter code at the end which deletes the rows you don't want , so you don't have to do it yourself)
Its now a "virtual" hlookup on the array matches in your selection (or data), by deleting all rows that match your specifications/requirements leaving you with the data you want.
I know and have a huge hunch it could be further improved, expanded streamlined much further (starting with the way I produce the arrays), but I'm happy with its functionality and potential scope for now.
[1]: https://www.youtube.com/watch?v=UwNcSZrzd_w&t=6s)

Selecting rows based on conditions (column) using the comparison operator >

While discovering python, i found myself stuck trying to select rows (food items) based on values of a column (macro nutriments). My condition uses a relational operation and the output is not the correct.
(particularly with the > or < operators not having the problem with the == operator).
data.loc[data['protein']=='10']
Result of my code sample
The result is correct because all the rows (food items) seem to have a protein of value 10.
data.loc[data['protein']>'10']
Result of my code sample
The result is incorrect because all the rows have a value that do not respect the condition given (you have rows with protein < 10 like you have rows with protein >10) .
data.loc[data['protein']>'10']
Any thoughts on the issue ? Do you think it's related with the file format (see code sample below) ? If so, how can i get around the problem ?
data = pd.read_excel('Documents/test.xlsx',names=col_names,usecols="D,E,F,G,H,J,M,N,P,Q,R,T,Y,Z,AA", index_col =[3]).
Thanks in advance and Happy Holidays !!
[EDITED]
So did more digging, indeed i am comparing two different things. #Daniel Mesejo the type of protein is Object. Since i want to have the protein column in a float format, i decided to convert it into string first and then into float. Unfortunately converting it to string using the .astype(str) didn't work
result

Use data['protein'] = data['protein'].astype('int64') to convert string to integer and then retry what you were doing.

Your problem is that you are comparing string rather than integers. Change data.loc[data['protein']>'10'] for data.loc[int(data['protein'])>10]

How do you print out lines from a file that match a certain condition, but you have many columns to check?

name played wins loses
Leeroy,19,7,12
Jenkins,19,8,11
Tyler,19,0,19
Napoleon Wilson,19,7,12
Big Boss,19,7,12
Game Dude,19,5,14
Macho Man,19,3,16
Space Pirate,19,6,13
Billy Casper,19,7,12
Otacon,19,7,12
Big Brother,19,7,12
Ingsoc,19,5,14
Ripley,19,5,14
M'lady,19,4,15
Einstein100,19,8,11
Dennis,19,5,14
Esports,19,8,11
RNGesus,19,7,12
Kes,19,9,10
Magnitude,19,6,13
Basically, this is a file called firesideResults, which i open in my code and i have to check through it. If the win column contains a 0 i do not print it out, so if it contains a number other than zero i display it on the screen. However, i have multiple lists of numbers to deal with and i can't find how to only deal with one column of numbers.
my code was going to be
if option == ("C") or option == ("c"):
answer = False
file_3 = open("firesideResults.txt")
for column in file_3:
if ("0" not in column):
print(column)
But unfortunately, one of the other columns of code contain a 0 so i cannot do that. Thank you for your help and if possible please list any questions that i could check for help as i have been searching for so long.

Since you have comma-separated fields, the best way would be to use csv!
import csv
with open("firesideResults.txt") as file_3:
cr = csv.reader(file_3)
for row in cr:
if row[2]!="0":
print(row)
if third column of each row is not 0, then print it.
no substring issue: checks for exact field
checks for field at the "win" column only, not the other ones.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding duplicates across multiple groups in a column- Python - python

Related

What's the logic behind locating elements using letters in pandas?

Need to find second maximum number in the list. All the details pasted below in the body. Please assist

Finding rows with specific words in an excel sheet [closed]

Selecting rows based on conditions (column) using the comparison operator >

How do you print out lines from a file that match a certain condition, but you have many columns to check?

Categories

Resources