Create a list with duplicates [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I'm implementing a procedure in my case "getDuplicatesAlphabetical" that takes a list of person objects and returns a tuple containing all the names that appear multiple times
at this..
This is what I have so far:
def getDuplicatesAlphabetical(listOfPersonObjects):
l = []
dup = []
for person in listOfPersonObjects:
if person not in l:
l.append(person)
else:
dup.append(person)
return sorted(dup)
getDuplicatesAlphabetical(toObjectList(['Thomas', 'Michael', 'Thomas','Susanne','Michael','Thomas','Alfred','Alfred']))
#shall output: ('Alfred', 'Michael', 'Thomas')**
I just do not understand what is missing.. Can somebody help me?
Regards, Mike

This is what a set is for! Checking if an item exists in a set is much cheaper O(1) than checking if it exists in a list O(n). The additional advantage is that adding an element that already exists in the set doesn't actually duplicate it.
def getDuplicatesAlphabetical(listOfPersonObjects):
l = set()
dup = set()
for person in listOfPersonObjects:
if person not in l:
l.add(person)
else:
dup.add(person)
return sorted(list(dup))
Testing this, we get
getDuplicatesAlphabetical(toObjectList(['Thomas', 'Michael', 'Thomas','Susanne','Michael','Thomas','Alfred','Alfred']))
# Output: ['Alfred', 'Michael', 'Thomas']
Another way is to count the instances of all names, and return the ones that have more than one occurrence. You can use the inbuilt collections.Counter for that, or do it yourself.
def getDuplicatesAlphabetical(listOfPersonObjects):
counts = {}
for person in listOfPersonObjects:
countkey = person
counts[countkey] = counts.get(countkey, 0) + 1
return sorted([name for name, count in counts.items() if count > 1])
Remember that dict keys can only be immutable objects, so if listOfPersonObjects has mutable elements, you will have to do something like countkey = person.name or countkey = person['name']

Your code doesn't check to see if the item is already in dup. An easy fix would be to add:
if person in dup:
continue
or change the else to:
elif person not in dup:
dup.append(person)
FWIW a much easier way to count up items (e.g. for finding duplicates) is collections.Counter:
>>> from collections import Counter
>>> def get_dupes_sorted(names):
... return sorted(name for name, count in Counter(names).items() if count > 1)
...
>>> get_dupes_sorted(['Thomas', 'Michael', 'Thomas', 'Susanne', 'Michael', 'Thomas', 'Alfred', 'Alfred'])
['Alfred', 'Michael', 'Thomas']
Using Counter should also be faster for large data sets than using lists because it uses a dictionary internally; checking if an item is in a list requires scanning the entire list (so it gets slower as the list gets bigger, i.e. "linear time"), whereas locating an item in a dictionary or set takes the same amount of time (i.e. "constant time") regardless of how many other items there are.
I'm assuming that a personObject is either an alias for a string or is an object that correctly implements the various comparators that are needed for in and sorted to work -- if not, then that might be an additional problem! But it's impossible to debug that without seeing your implementation of toObjectList.

First remove listOfObject() function. It is not defined yet.
and then change the if statement like the following
if listOfPersonObjects.count(person)>1:
if person in dup:
l.append(person)
else:
dup.append(person)
Output:['Alfred', 'Michael', 'Thomas']

Related

Comparing items through a tuple in Python

I am given an assignment when I am supposed to define a function that returns the second element of a tuple if the first element of a tuple matches with the argument of a function.
Specifically, let's say that I have a list of student registration numbers that goes by:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
And I have defined a function that is supposed to take in the argument of reg_num, such as "S12345", and return the name of the student in this case, "John". If the number does not match at all, I need to print "Not found" as a message. In essence, I understand that I need to sort through the larger tuple, and compare the first element [0] of each smaller tuple, then return the [1] entry of each smaller tuple. Here's what I have in mind:
def get_student_name(reg_num, particulars):
for i in records:
if reg_num == particulars[::1][0]:
return particulars[i][1]
else:
print("Not found")
I know I'm wrong, but I can't tell why. I'm not well acquainted with how to sort through a tuple. Can anyone offer some advice, especially in syntax? Thank you very much!
When you write for i in particulars, in each iteration i is an item of the collection and not an index. As such you cannot do particulars[i] (and there is no need - as you already have the item). In addition, remove the else statement so to not print for every item that doesn't match condition:
def get_student_name(reg_num, particulars):
for i in particulars:
if reg_num == i[0]:
return i[1]
print("Not found")
If you would want to iterate using indices you could do (but less nice):
for i in range(len(particulars)):
if reg_num == particulars[i][0]:
return particulars[i][1]
Another approach, provided to help learn new tricks for manipulating python data structures:
You can turn you tuple of tuples:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
into a dictionary:
>>> pdict = dict(particulars)
>>> pdict
{'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
You can look up the value by supplying the key:
>>> r = 'S23456'
>>> dict(pdict)[r]
'Max'
The function:
def get_student_name(reg, s_data):
try:
return dict(s_data)[reg]
except:
return "Not Found"
The use of try ... except will catch errors and just return Not Found in the case where the reg is not in the tuple in the first place. It will also catch of the supplied tuple is not a series of PAIRS, and thus cannot be converted the way you expect.
You can read more about exceptions: the basics and the docs to learn how to respond differently to different types of error.
for loops in python
Gilad Green already answered your question with a way to fix your code and a quick explanation on for loops.
Here are five loops that do more or less the same thing; I invite you to try them out.
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
for t in particulars:
print("{} {}".format(t[0], t[1]))
for i in range(len(particulars)):
print("{}: {} {}".format(i, particulars[i][0], particulars[i][1]))
for i, t in enumerate(particulars):
print("{}: {} {}".format(i, t[0], t[1]))
for reg_value, student_name in particulars:
print("{} {}".format(reg_value, student_name))
for i, (reg_value, student_name) in enumerate(particulars):
print("{}: {} {}".format(i, reg_value, student_name))
Using dictionaries instead of lists
Most importantly, I would like to add that using an unsorted list to store your student records is not the most efficient way.
If you sort the list and maintain it in sorted order, then you can use binary search to search for reg_num much faster than browsing the list one item at a time. Think of this: when you need to look up a word in a dictionary, do you read all words one by one, starting by "aah", "aback", "abaft", "abandon", etc.? No; first, you open the dictionary somewhere in the middle; you compare the words on that page with your word; then you open it again to another page; compare again; every time you do that, the number of candidate pages diminishes greatly, and so you can find your word among 300,000 other words in a very small time.
Instead of using a sorted list with binary search, you could use another data structure, for instance a binary search tree or a hash table.
But, wait! Python already does that very easily!
There is a data structure in python called a dictionary. See the documentation on dictionaries. This structure is perfectly adapted to most situations where you have keys associated to values. Here the key is the reg_number, and the value is the student name.
You can define a dictionary directly:
particulars = {'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
Or you can convert your list of tuples to a dictionary:
particulars = (("S12345", "John"), ("S23456", "Max"), ("S34567", "Mary"))
particulars_as_dict = dict(particulars)
Then you can check if an reg_number is in the dictionary, with they keyword in; you can return the student name using square brackets or with the method get:
>>> particulars = {'S12345': 'John', 'S23456': 'Max', 'S34567': 'Mary'}
>>> 'S23456' in particulars
True
>>> 'S98765' in particulars
False
>>>
>>> particulars['S23456']
'Max'
>>> particulars.get('S23456')
'Max'
>>> particulars.get('S23456', 'not found')
'Max'
>>>
>>> particulars['S98765']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'S98765'
>>> particulars.get('S98765')
None
>>> particulars.get('S98765', 'not found')
'not found'

Simplifying a list into categories

I am a new Python developer and was wondering if someone can help me with this. I have a dataset that has one column that describes a company type. I noticed that the column has, for example, surgical, surgery listed. It has eyewear, eyeglasses and optometry listed. So instead of having a huge list in this column, i want to simply the category to say that if you find a word that contains "eye," "glasses" or "opto" then just change it to "eyewear." My initial code looks like this:
def map_company(row):
company = row['SIC_Desc']
if company in 'Surgical':
return 'Surgical'
elif company in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']:
return 'Eyewear'
elif company in ['Cotton', 'Bandages', 'gauze', 'tape']:
return 'First Aid'
elif company in ['Dental', 'Denture']:
return 'Dental'
elif company in ['Wheelchairs', 'Walkers', 'braces', 'crutches', 'ortho']:
return 'Mobility equipments'
else:
return 'Other'
df['SIC_Desc'] = df.apply(map_company,axis=1)
This is not correct though because it is changing every item into "Other," so clearly my syntax is wrong. Can someone please help me simplify this column that I am trying to relabel?
Thank you
It is hard to answer without having the exact content of your data set, but I can see one mistake. According to your description, it seems you are looking at this the wrong way. You want one of the words to be in your company description, so it should look like that:
if any(test in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers'])
However you might have a case issue here so I would recommend:
company = row['SIC_Desc'].lower()
if any(test.lower() in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']):
return 'Eyewear'
You will also need to make sure company is a string and 'SIC_Desc' is a correct column name.
In the end your function will look like that:
def is_match(company,names):
return any(name in company for name in names)
def map_company(row):
company = row['SIC_Desc'].lower()
if 'surgical' in company:
return 'Surgical'
elif is_match(company,['eye','glasses','opthal','spectacles','optometers']):
return 'Eyewear'
elif is_match(company,['cotton', 'bandages', 'gauze', 'tape']):
return 'First Aid'
else:
return 'Other'
Here is an option using a reversed dictionary.
Code
import pandas as pd
# Sample DataFrame
s = pd.Series(["gauze", "opthal", "tape", "surgical", "eye", "spectacles",
"glasses", "optometers", "bandages", "cotton", "glue"])
df = pd.DataFrame({"SIC_Desc": s})
df
LOOKUP = {
"Eyewear": ["eye", "glasses", "opthal", "spectacles", "optometers"],
"First Aid": ["cotton", "bandages", "gauze", "tape"],
"Surgical": ["surgical"],
"Dental": ["dental", "denture"],
"Mobility": ["wheelchairs", "walkers", "braces", "crutches", "ortho"],
}
REVERSE_LOOKUP = {v:k for k, lst in LOOKUP.items() for v in lst}
def map_company(row):
company = row["SIC_Desc"].lower()
return REVERSE_LOOKUP.get(company, "Other")
df["SIC_Desc"] = df.apply(map_company, axis=1)
df
Details
We define a LOOKUP dictionary with (key, value) pairs of expected output and associated words, respectively. Note, the values are lowercase to simplify searching. Then we use a reversed dictionary to automatically invert the key value pairs and improve the search performance, e.g.:
>>> REVERSE_LOOKUP
{'bandages': 'First Aid',
'cotton': 'First Aid',
'eye': 'Eyewear',
'gauze': 'First Aid',
...}
Notice these reference dictionaries are created outside the mapping function to avoid rebuilding dictionaries for every call to map_company(). Finally the mapping function quickly returns the desired output using the reversed dictionary by calling .get(), a method that returns the default argument "Other" if no entry is found.
See #Flynsee's insightful answer for an explanation of what is happening in your code. The code is cleaner compared a bevy of conditional statements.
Benefits
Since we have used dictionaries, the search time should be relatively fast, O(1) compared to a O(n) complexity using in. Moreover, the main LOOKUP dictionary is adaptable and liberated from manually implementing extensive conditional statements for new entries.

Having problems in extracting duplicates

I am stumped with this problem, and no matter how I get around it, it is still giving me the same result.
Basically, supposedly I have 2 groups - GrpA_null and GrpB_null, each having 2 meshes in them and are named exactly the same, brick_geo and bars_geo
- Result: GrpA_null --> brick_geo, bars_geo
But for some reason, in the code below which I presume is the one giving me problems, when it is run, the program states that GrpA_null has the same duplicates as GrpB_null, probably they are referencing the brick_geo and bars_geo. As soon as the code is run, my children geo have a numerical value behind,
- Result: GrpA_null --> brick_geo0, bars_geo0, GrpB_null1 --> brick_geo, bars_geo1
And so, I tried to modify the code such that it will as long as the Parent (GrpA_null and GrpB_null) is different, it shall not 'touch' on the children.
Could someone kindly advice me on it?
def extractDuplicateBoxList(self, inputs):
result = {}
for i in range(0, len(inputs)):
print '<<< i is : %s' %i
for n in range(0, len(inputs)):
print '<<< n is %s' %n
if i != n:
name = inputs[i].getShortName()
# Result: brick_geo
Lname = inputs[i].getLongName()
# Result: |GrpA_null|concrete_geo
if name == inputs[n].getShortName():
# If list already created as result.
if result.has_key(name):
# Make sure its not already in the list and add it.
alreadyAdded = False
for box in result[name]:
if box == inputs[i]:
alreadyAdded = True
if alreadyAdded == False:
result[name].append(inputs[i])
# Otherwise create a new list and add it.
else:
result[name] = []
result[name].append(inputs[i])
return result
There are a couple of things you may want to be aware of. First and foremost, indentation matters in Python. I don't know if the indentation of your code as is is as intended, but your function code should be indented further in than your function def.
Secondly, I find your question a little difficult to understand. But there are several things which would improve your code.
In the collections module, there is (or should be) a type called defaultdict. This type is similar to a dict, except for it having a default value of the type you specify. So a defaultdict(int) will have a default of 0 when you get a key, even if the key wasn't there before. This allows the implementation of counters, such as to find duplicates without sorting.
from collections import defaultdict
counter = defaultdict(int)
for item in items:
counter[item] += 1
This brings me to another point. Python for loops implement a for-each structure. You almost never need to enumerate your items in order to then access them. So, instead of
for i in range(0,len(inputs)):
you want to use
for input in inputs:
and if you really need to enumerate your inputs
for i,input in enumerate(inputs):
Finally, you can iterate and filter through iterable objects using list comprehensions, dict comprehensions, or generator expressions. They are very powerful. See Create a dictionary with list comprehension in Python
Try this code out, play with it. See if it works for you.
from collections import defaultdict
def extractDuplicateBoxList(self, inputs):
counts = defaultdict(int)
for input in inputs:
counts[input.getShortName()] += 1
dup_shns = set([k for k,v in counts.items() if v > 1])
dups = [i for i in inputs if input.getShortName() in dup_shns]
return dups
I was on the point to write the same remarks as bitsplit, he has already done it.
So I just give you for the moment a code that I think is doing exactly the same as yours, based on these remarks and the use of the get dictionary's method:
from collections import defaultdict
def extract_Duplicate_BoxList(self, inputs):
result = defaultdict()
for i,A in enumerate(inputs):
print '<<< i is : %s' %i
name = A.getShortName() # Result: brick_geo
Lname = A.getLongName() # Result: |GrpA_null|concrete_geo
for n in (j for j,B in enumerate(inputs)
if j!=i and B.getShortName()==name):
print '<<< n is %s' %n
if A not in result.get(name,[])):
result[name].append(A)
return result
.
Secondly, as bitsplit said it, I find your question ununderstandable.
Could you give more information on the elements of inputs ?
Your explanations about GrpA_null and GrpB_null and the names and the meshes are unclear.
.
EDIT:
If my reduction/simplification is correct, examining it , I see that What you essentially does is to compare A and B elements of inputs (with A!=B) and you record A in the dictionary result at key shortname (only one time) if A and B have the same shortname shortname;
I think this code can still be reduced to just:
def extract_Duplicate_BoxList(inputs):
result = defaultdict()
for i,A in enumerate(inputs):
print '<<< i is : %s' %i
result[B.getShortName()].append(A)
return result
this may be do what your looking for if I understand it, which seems to be comparing the sub-hierarchies of different nodes to see if they are they have the same names.
import maya.cmds as cmds
def child_nodes(node):
''' returns a set with the relative paths of all <node>'s children'''
root = cmds.ls(node, l=True)[0]
children = cmds.listRelatives(node, ad=True, f=True)
return set( [k[len(root):] for k in children])
child_nodes('group1')
# Result: set([u'|pCube1|pCubeShape1', u'|pSphere1', u'|pSphere1|pSphereShape1', u'|pCube1']) #
# note the returns are NOT valid maya paths, since i've removed the root <node>,
# you'd need to add it back in to actually access a real shape here:
all_kids = child_nodes('group1')
real_children = ['group1' + n for n in all_kids ]
Since the returns are sets, you can test to see if they are equal, see if one is a subset or superset of the other, see what they have in common and so on:
# compare children
child_nodes('group1') == child_nodes('group2')
#one is subset:
child_nodes('group1').issuperset(child_nodes('group2'))
Iterating over a bunch of nodes is easy:
# collect all the child sets of a bunch of nodes:
kids = dict ( (k, child_nodes(k)) for k in ls(*nodes))

How to compare an element of a tuple (int) to determine if it exists in a list

I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.
Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]
The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])
The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).
The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.

Remove Duplicate Items in Dictionary

I'm trying to remove duplicate items in a list through a dictionary:
def RemoveDuplicates(list):
d = dict()
for i in xrange(0, len(list)):
dict[list[i]] = 1 <------- error here
return d.keys()
But it is raising me the following error:
TypeError: 'type' object does not support item assignment
What is the problem?
You should have written:
d[list[i]] = 1
But why not do this?
def RemoveDuplicates(l):
return list(set(l))
Also, don't use built-in function names as variable names. It can lead to confusing bugs.
In addition to what others have said, it is unpythonic to do this:
for i in xrange(0, len(lst)):
do stuff with lst[i]
when you can do this instead:
for item in lst:
do stuff with item
dict is the type, you mean d[list[i]] = 1.
Addition: This points out the actual error in your code. But the answers provided by others provide better way to achieve what you are aiming at.
def remove_duplicates(myList):
return list (set(myList))
From looking at your code it seems that you are not bothered about the ordering of elements and concerned only about the uniqueness. In such a case, a set() could be a better data structure.
The problem in your code is just to use a function argument name which is not the name of the built-in type list and later on the type dict in the expression dict[list[i]].
Note that using list(set(seq)) will likely change the ordering of the remaining items. If retaining their order is important, you need to make a copy of the list:
items = set()
copy = []
for item in seq:
if not item in items:
copy.add(item)
items.append(item)
seq = copy

Categories