Related
I have a list, and in that list, I have a lot of duplicated values. This is the format of the list:
https://imgur.com/a/tj2ZwxG
So I have some fields, in this order: "User_ID" "Movie_ID" "Rating" "Time"
What I want to do is, remove, from the 5th occurrence of "User_ID" untill I find a differente "User_ID". For example:
Let's suppose that I have a list with only "User_ID" (from 1 - 196) like this:
1, 1, 1 ,1 ,1, 1, 2 ,2 , 2, 2, 2, 2, 2...
In this case, I have six occurrences of number 1 and seven occurrences of number 2.
So, I will remove, from 1, after the fifth occurrence, until I find the first "2". And the same thing for 2: I will start removing after its fifth occurrence, untill I find a new number, which will be "3", and so on.
So, I will get a new list, like this: 1, 1, 1, 1, 1, 2, 2, 2, 2, 2
containing only 5 instances of each different element.
I know I can acess all the "User_ID" field like this: list[index]["User_ID"]
is there a function that does that? Or if there isn't, could someone help me to create one?
Thanks for the help!
What I was trying to do was something like this:
a = 0
b = 1
start = 0
position = 0
while(something that I don't know):
while(list[a]['User_ID'] == list[b]['User_ID']): #iterate through the list, and I only advance to the next elements if the previous and next elements are the same
a+=1
b+=1
position+=1
if(list[a]['User_ID'] != list[b]['User_ID']): #when I finally find a different element
del new_list[start:start+position] #I delete from the start position, which is five untill the position before the different element.
a+=1
b+=1
start+=5
list=[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3]
unique=set(list)
for x in unique:
y=list.count(x)
while y>5:
list.remove(x)
y-=1
print(list)
Your input seems to be a list of dict instances. You can use various itertools to only keep 5 dicts with same User_ID key in a space and time efficient manner:
from itertools import chain, groupby, islice
from operator import itemgetter
lst = [{'User_ID': 1, ...}, {'User_ID': 1, ...}, ..., {'User_ID': 2, ...}, ...]
key = itemgetter('User_ID')
only5 = list(chain.from_iterable(islice(g, 5) for _, g in groupby(lst, key=key)))
This groups the list into chunks with the same User_ID and then takes the first 5 from each chunk into the new list.
I am mostly confused by your list of [1,1,1,1,1] etc, it looks like you have a list of dicts or objects.
If you care about every field you can probably just make it a set then back into a list:
my_list = list(set(my_list))
if they are objects, you can override __eq__(self,other) and __hash__(self) and I think you will be able to use the same list/set/list transform to remove duplicates.
I am currently creating my program regarding data structures using Python 2.7.13 and I find it hard to finish it. With this, i want to remove all the duplicates between the two lists while maintaining the duplicate elements in one specific list considering that it is not the same with the other list.
To make it clear, I will be presenting an example,
Suppose:
input:
a= [1,2,2,5,6,6]
b= [2,5,7,9]
expected output:
c= [7,9,1,6,6]
You can start with
>>> set(a).intersection(set(b))
{2, 5}
>>> set(a).union(set(b)) - set(a).intersection(set(b))
{1, 6, 7, 9}
Let us say you have a set called common
common = list(set(a).union(set(b)) - set(a).intersection(set(b)))
Then you can find your list as:
>>> [c for c in (a+b) if c in common]
[1, 6, 6, 7, 9]
I have multiple lists each containing words and the a number representing the number of times the word showed up in a article. I want to combine these lists together keeping unique words separate and adding the counts of same words. Example:
list_one = [(u'he':3),(u'she':2),(u'it':1),(u'pineapple':1)]
list_two = [(u'he':4),(u'she':1),(u'it':0)]
and then by combining list_one and list_two return a list_three
list_three = [(u'he':7),(u'she':3),(u'it':1),(u'pineapple':1)]
I got lists using collections.Counter from articles and have tried using Counter.update to add the two together . I would like to keep the order, meaning keeping the highest number of counts in the front of the list. Any help would be great.
Swiss
Python Counters can actually be summed! - http://ideone.com/spJMsx
Several mathematical operations are provided for combining Counter objects to produce multisets (counters that have counts greater than zero). Addition and subtraction combine counters by adding or subtracting the counts of corresponding elements.
From the Python documentation
So this:
from collections import Counter
list1 = Counter(['eggs','spam','spam','eggs','sausage','and spam'])
list2 = Counter(['spam','bacon','spam','eggs','sausage','and spam'])
print list1
print list2
print list1+list2
Outputs this:
Counter({'eggs': 2, 'spam': 2, 'sausage': 1, 'and spam': 1})
Counter({'spam': 2, 'eggs': 1, 'bacon': 1, 'sausage': 1, 'and spam': 1})
Counter({'spam': 4, 'eggs': 3, 'sausage': 2, 'and spam': 2, 'bacon': 1})
Let's start with your two lists, adapted slightly to work in Python:
list_one = [(u'he', 3),(u'she', 2),(u'it', 1),(u'pineapple', 1)]
list_two = [(u'he', 4),(u'she', 1),(u'it',0)]
Now, let's combine them:
d = {word:value for word, value in list_one}
for word, value in list_two:
d[word] = d.get(word, 0) + value
print(d)
This produces the desired numbers in dictionary form:
{u'it': 1, u'pineapple': 1, u'she': 3, u'he': 7}
The above is a dictionary. If you wanted it back in list of tuple form, just use list(d.items()):
[(u'it', 1), (u'pineapple', 1), (u'she', 3), (u'he', 7)]
I need to compare two lists of numbers and count how many elements of first list are there in second list. For example,
a = [2, 3, 3, 4, 4, 5]
b1 = [0, 2, 2, 3, 3, 4, 6, 8]
here I should get result of 4: I should count '2' 1 time (as it happens only once in first list), '3' - 2 times, '4' - 1 time (as it happens only once in second list). I was using the following code:
def scoreIn(list1, list2):
score=0
list2c=list(list2)
for i in list1:
if i in list2c:
score+=1
list2c.remove(i)
return score
it works correctly, but too slow for my case (I call it 15000 times). I read a hint about 'walking' through sorted lists which was supposed to be faster, so I tried to do like that:
def scoreWalk(list1, list2):
score=0
i=0
j=0
len1=len(list1) # we assume that list2 is never shorter than list1
while i<len1:
if list1[i]==list2[j]:
score+=1
i+=1
j+=1
elif list1[i]>list2[j]:
j+=1
else:
i+=1
return score
Unfortunately this code is even slower. Is there any way to make it more efficient? In my case, both lists are sorted, contains only integers, and list1 is never longer than list2.
You can use the intersection feature of collections.Counter to solve the problem in an easy and readable way:
>>> from collections import Counter
>>> intersection = Counter( [2,3,3,4,4,5] ) & Counter( [0, 2, 2, 3, 3, 4, 6, 8] )
>>> intersection
Counter({3: 2, 2: 1, 4: 1})
As #Bakuriu says in the comments, to obtain the number of elements in the intersection (including duplicates), like your scoreIn function, you can then use sum( intersection.values() ).
However, doing it this way you're not actually taking advantage of the fact that your data is pre-sorted, nor of the fact (mentioned in the comments) that you're doing this over and over again with the same list.
Here is a more elaborate solution more specifically tailored for your problem. It uses a Counter for the static list and directly uses the sorted dynamic list. On my machine it runs in 43% of the run-time of the naïve Counter approach on randomly generated test data.
def common_elements( static_counter, dynamic_sorted_list ):
last = None # previous element in the dynamic list
count = 0 # count seen so far for this element in the dynamic list
total_count = 0 # total common elements seen, eventually the return value
for x in dynamic_sorted_list:
# since the list is sorted, if there's more than one element they
# will be consecutive.
if x == last:
# one more of the same as the previous element
# all we need to do is increase the count
count += 1
else:
# this is a new element that we haven't seen before.
# first "flush out" the current count we've been keeping.
# - count is the number of times it occurred in the dynamic list
# - static_counter[ last ] is the number of times it occurred in
# the static list (the Counter class counted this for us)
# thus the number of occurrences the two have in common is the
# smaller of these numbers. (Note that unlike a normal dictionary,
# which would raise KeyError, a Counter will return zero if we try
# to look up a key that isn't there at all.)
total_count += min( static_counter[ last ], count )
# now set count and last to the new element, starting a new run
count = 1
last = x
if count > 0:
# since we only "flushed" above once we'd iterated _past_ an element,
# the last unique value hasn't been counted. count it now.
total_count += min( static_counter[ last ], count )
return total_count
The idea of this is that you do some of the work up front when you create the Counter object. Once you've done that work, you can use the Counter object to quickly look up counts, just like you look up values in a dictionary: static_counter[ x ] returns the number of times x occurred in the static list.
Since the static list is the same every time, you can do this once and use the resulting quick-lookup structure 15 000 times.
On the other hand, setting up a Counter object for the dynamic list may not pay off performance-wise. There is a little bit of overhead involved in creating a Counter object, and we'd only use each dynamic list Counter one time. If we can avoid constructing the object at all, it makes sense to do so. And as we saw above, you can in fact implement what you need by just iterating through the dynamic list and looking up counts in the other counter.
The scoreWalk function in your post does not handle the case where the biggest item is only in the static list, e.g. scoreWalk( [1,1,3], [1,1,2] ). Correcting that, however, it actually performs better than any of the Counter approaches for me, contrary to the results you report. There may be a significant difference in the distribution of your data to my uniformly-distributed test data, but double-check your benchmarking of scoreWalk just to be sure.
Lastly, consider that you may be using the wrong tool for the job. You're not after short, elegant and readable -- you're trying to squeeze every last bit of performance out of a rather simple task. CPython allows you to write modules in C. One of the primary use cases for this is to implement highly optimized code. It may be a good fit for your task.
You can do this with a dict comprehension:
>>> a = [2, 3, 3, 4, 4, 5]
>>> b1 = [0, 2, 2, 3, 3, 4, 6, 8]
>>> {k: min(b1.count(k), a.count(k)) for k in set(a)}
{2: 1, 3: 2, 4: 1, 5: 0}
This is much faster if set(a) is small. If set(a) is more than 40 items, the Counter based solution is faster.
I'm a Python newbie, I worked with list for 2 months and I have some questions. I have some list and they have duplicate items. I can get duplicate items between 2 lists, now I want the number of lists and the deepness increased like this example:
http://i219.photobucket.com/albums/cc213/DoSvn/example.png.
I want to get parents of duplicate items from red part, not blue part or list of these duplicate items. How can I do it ?
Thank you :)
Update:
Thank for your answers :D I have used Set and it's great. But I guess if I don't know about the size of the list of lists and nothing more, they are dynamic lists, can I get all of the red parts like that example: http://i219.photobucket.com/albums/cc213/DoSvn/example02.png ?
If you are searching something like this: http://i219.photobucket.com/albums/cc213/DoSvn/example02.png
Then you can try the Counter (available in Python 2.7+). It should work like this:
from collections import Counter
c = Counter()
for s in (listOfLists):
c.update(s)
for item, nbItems in c.iteritems():
if nbItems == 3:
print '%s belongs to three lists.' % item
Or with older Pythons:
counter = {}
for s in (listOfLists):
for elem in s:
counter[elem] = counter.get(elem, 0) + 1
for item, nbItems in counter.iteritems():
if nbItems == 3:
print '%s belongs to three lists.' % item
Use sets and you can get intersection, union, subtraction or any complex combination
s1 = set([1, 2, 3, 4, 5])
s2 = set([4, 5, 6, 7, 8])
s3 = set([1, 3, 5, 7, 9])
# now to get duplicate between s1, s2 and s2 take intersection
print s1&s2&s3
output:
set([5])