Count Distinct Values in a List of Lists - python

I just imported the values from a .csv file to a list of lists, and now I need to know how many distinct users are there. The file itself looks like to following:
[['123', 'apple'], ['123', 'banana'], ['345', 'apple'], ['567', 'berry'], ['567', 'banana']]
Basically, I need to know how many distinct users (first value in each sub-list is a user ID) are there (3 in this case, over 6,000 after doing some Excel filtering), and what are the frequencies for the food itself: {'apple': 2, 'banana': 2, 'berry': 1}.
Here is the code I have tried to use for distinct values counts (using Python 2.7):
import csv
with open('food.csv', 'rb') as food:
next(food)
for line in food:
csv_food = csv.reader(food)
result_list = list(csv_follows)
result_distinct = list(x for l in result_list for x in l)
print len(result_distinct)

You can use [x[0] for x in result_list] to get a list of all the ids. Then you create a set, that is all list of all unique items in that list. The length of the set will then give you the number of unique users.
len(set([x[0] for x in result_list]))

Well that is what a Counter is all about:
import csv
from collections import Counter
result_list = []
with open('food.csv', 'rb') as food:
next(food)
for line in food:
csv_food = csv.reader(food)
result_list += list(csv_follows)
result_counter = Counter(x[1] for x in result_list)
print len(result_counter)
A Counter is a special dictionary. Internally the dictionary will contain {'apple': 2, 'banana': 2, 'berry': 1} so you can inspect all elements with their counts. len(result_counter) will give the number of distinct elements whereas sum(result_counter.values()) will give the total number of elements).
EDIT: apparently you want to count the number of distinct users. You can do this with:
len({x[0] for x in result_list})
The {.. for x in result_list} is set comprehension.

To get the distinct users, you can use a set:
result_distinct = len({x[0] for x in result_list})
And the frequencies, you can use collections.Counter:
freqs = collections.Counter([x[1] for x in result_list])

For the first question, use set,
import operator
lists = [['123', 'apple'], ['123', 'banana'], ['345', 'apple'], ['567', 'berry'], ['567', 'banana']]
nrof_users = len(set(map(operator.itemgetter(0), lists)))
print(nrof_users)
# 3
For the second question, use collections.Counter,
import collections
import operator
result = collections.Counter(map(operator.itemgetter(1), lists))
print(result)
# Counter({'apple': 2, 'banana': 2, 'berry': 1})

A=[[0, 1],[0, 3],[1, 3],[3, 4],[3, 6],[4, 5],[4, 7],[5, 7],[6, 4]]
K = []
for _ in range(len(A)):
K.extend(A[_])
print(set(K))
OUTPUT:
{0, 1, 3, 4, 5, 6, 7}
In python extend function extends the list instead of appending it that's what we need and then use set to print distinct values.

Related

Python: Sorting a Python list to show which string is most common to least common and the number of times it appears

I have a winners list which will receive different entries each time the rest of my code is ran:
eg the list could look like:
winners = ['Tortoise','Tortoise','Hare']
I am able to find the most common entry by using:
mostWins = [word for word, word_count in Counter(winners).most_common(Animalnum)]
which would ouput:
['Tortoise']
My problem is displaying the entire list from most common to least common and the how many times each string is found in the list.
Just iterate over that .most_common:
>>> winners = ['Tortoise','Tortoise','Hare','Tortoise','Hare','Bob']
>>> import collections
>>> for name, wins in collections.Counter(winners).most_common():
... print(name, wins)
...
Tortoise 3
Hare 2
Bob 1
>>>
Counter is just a dictionary internally.
from collections import Counter
winners = ['Tortoise','Tortoise','Hare','Tortoise','Hare','Bob', 'Bob', 'John']
counts = Counter(winners)
print(counts)
# Counter({'Tortoise': 3, 'Hare': 2, 'Bob': 2, 'John': 1})
print(counts['Hare'])
# 2
Furthermore, the .most_common(n) method is just a .items() call on it that limits the output to n length.
So you should only use it, if you'd like to show the top n, e.g.: the top 3
counts.most_common(3)
# [('Tortoise', 3), ('Hare', 2), ('Bob', 2)]

Counting by partial string in Python

So, I have a list of strings (upper case letters).
list = ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01', 'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
How can I group and count occurrence in the list?
Expected output:
You may try using the Counter library here:
from collections import Counter
import re
list = ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01', 'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
list = [re.sub(r'\d+$', '', x) for x in list]
print(Counter(list))
This prints:
Counter({'HORSE': 4, 'CAT': 3, 'DOG': 2})
Note that the above approach simply strips off the number endings of each list element, then does an aggregation on the alpha names only.
you can also use dictionary
list= ['DOG01', 'CAT02', 'HORSE04', 'DOG02', 'HORSE01',
'CAT01', 'CAT03', 'HORSE03', 'HORSE02']
dic={}
for i in list:
i=i[:-2]
if i in dic:
dic[i]=dic[i]+1
else:
dic[i]=1
print(dic)

Create list / text of repeated strings based on dictionary number

I have the following dictionary:
mydict = {'mindestens': 2,
'Situation': 3,
'österreichische': 2,
'habe.': 1,
'Über': 1,
}
How can I get a list / text out of it, that the strings in my dictionary are repeated as the number is mapped in the dictionary to it:
mylist = ['mindestens', 'mindestens', 'Situation', 'Situation', 'Situation',.., 'Über']
mytext = 'mindestens mindestens Situation Situation Situation ... Über'
You might just use loops:
mylist = []
for word,times in mydict.items():
for i in range(times):
mylist.append(word)
itertools library has convenient features for such cases:
from itertools import chain, repeat
mydict = {'mindestens': 2, 'Situation': 3, 'österreichische': 2,
'habe.': 1, 'Über': 1,
}
res = list(chain.from_iterable(repeat(k, v) for k, v in mydict.items()))
print(res)
The output:
['mindestens', 'mindestens', 'Situation', 'Situation', 'Situation', 'österreichische', 'österreichische', 'habe.', 'Über']
For text version - joining a list items is trivial: ' '.join(<iterable>)

How to get count of unique values in a list

Given a list:
a = ['ed', 'ed', 'ed', 'ash', 'ash, 'daph']
I want to iterate through the list and get the top 2 most used names. So I should expect a result of ['ed', 'ash']
[Update]
how to go about this without using a library
collections.Counter has a most_common method:
from collections import Counter
a = ['ed', 'ed', 'ed', 'ash', 'ash', 'daph']
res = [item[0] for item in Counter(a).most_common(2)]
print(res) # ['ed', 'ash']
with most_common(2) i get the 2 most common elements (and their multiplicity); the list-comprehension then removes the multiplicity and just removes the item in your original list.
try:
>>> from collections import Counter
>>> c = Counter(a)
>>> c
Counter({'ed': 3, 'ash': 2, 'daph': 1})
# Sort items based on occurrence using most_common()
>>> c.most_common()
[('ed', 3), ('ash', 2), ('daph', 1)]
# Get top 2 using most_common(2)
>>> [item[0] for item in c.most_common(2)]
['ed', 'ash']
# Get top 2 using sorted
>>> sorted(c, key=c.get, reverse=True)[:2]
['ed', 'ash']

Finding index of values in a list dynamically

I am having two lists as follows:
list_1
['A-1','A-1','A-1','A-2','A-2','A-3']
list_2
['iPad','iPod','iPhone','Windows','X-box','Kindle']
I would like to split the list_2 based on the index values in list_1. For instance,
list_a1
['iPad','iPod','iPhone']
list_a2
['Windows','X-box']
list_a3
['Kindle']
I know index method, but it needs the value to be matched to be passed along with. In this case, I would like to dynamically find the indexes of the values in list_1 with the same value. Is this possible? Any tips/hints would be deeply appreciated.
Thanks.
There are a few ways to do this.
I'd do it by using zip and groupby.
First:
>>> list(zip(list_1, list_2))
[('A-1', 'iPad'),
('A-1', 'iPod'),
('A-1', 'iPhone'),
('A-2', 'Windows'),
('A-2', 'X-box'),
('A-3', 'Kindle')]
Now:
>>> import itertools, operator
>>> [(key, list(group)) for key, group in
... itertools.groupby(zip(list_1, list_2), operator.itemgetter(0))]
[('A-1', [('A-1', 'iPad'), ('A-1', 'iPod'), ('A-1', 'iPhone')]),
('A-2', [('A-2', 'Windows'), ('A-2', 'X-box')]),
('A-3', [('A-3', 'Kindle')])]
So, you just want each group, ignoring the key, and you only want the second element of each element in the group. You can get the second element of each group with another comprehension, or just by unzipping:
>>> [list(zip(*group))[1] for key, group in
... itertools.groupby(zip(list_1, list_2), operator.itemgetter(0))]
[('iPad', 'iPod', 'iPhone'), ('Windows', 'X-box'), ('Kindle',)]
I would personally find this more readable as a sequence of separate iterator transformations than as one long expression. Taken to the extreme:
>>> ziplists = zip(list_1, list_2)
>>> pairs = itertools.groupby(ziplists, operator.itemgetter(0))
>>> groups = (group for key, group in pairs)
>>> values = (zip(*group)[1] for group in groups)
>>> [list(value) for value in values]
… but a happy medium of maybe 2 or 3 lines is usually better than either extreme.
Usually I'm the one rushing to a groupby solution ;^) but here I'll go the other way and manually insert into an OrderedDict:
list_1 = ['A-1','A-1','A-1','A-2','A-2','A-3']
list_2 = ['iPad','iPod','iPhone','Windows','X-box','Kindle']
from collections import OrderedDict
d = OrderedDict()
for code, product in zip(list_1, list_2):
d.setdefault(code, []).append(product)
produces a d looking like
>>> d
OrderedDict([('A-1', ['iPad', 'iPod', 'iPhone']),
('A-2', ['Windows', 'X-box']), ('A-3', ['Kindle'])])
with easy access:
>>> d["A-2"]
['Windows', 'X-box']
and we can get the list-of-lists in list_1 order using .values():
>>> d.values()
[['iPad', 'iPod', 'iPhone'], ['Windows', 'X-box'], ['Kindle']]
If you've noticed that no one is telling you how to make a bunch of independent lists with names like list_a1 and so on-- that's because that's a bad idea. You want to keep the data together in something which you can (at a minimum) iterate over easily, and both dictionaries and list of lists qualify.
Maybe something like this?
#!/usr/local/cpython-3.3/bin/python
import pprint
import collections
def main():
list_1 = ['A-1','A-1','A-1','A-2','A-2','A-3']
list_2 = ['iPad','iPod','iPhone','Windows','X-box','Kindle']
result = collections.defaultdict(list)
for list_1_element, list_2_element in zip(list_1, list_2):
result[list_1_element].append(list_2_element)
pprint.pprint(result)
main()
Using itertools.izip_longest and itertools.groupby:
>>> from itertools import groupby, izip_longest
>>> inds = [next(g)[0] for k, g in groupby(enumerate(list_1), key=lambda x:x[1])]
First group items of list_1 and find the starting index of each group:
>>> inds
[0, 3, 5]
Now use slicing and izip_longest as we need pairs list_2[0:3], list_2[3:5], list_2[5:]:
>>> [list_2[x:y] for x, y in izip_longest(inds, inds[1:])]
[['iPad', 'iPod', 'iPhone'], ['Windows', 'X-box'], ['Kindle']]
To get a list of dicts you can something like:
>>> inds = [next(g) for k, g in groupby(enumerate(list_1), key=lambda x:x[1])]
>>> {k: list_2[ind1: ind2[0]] for (ind1, k), ind2 in
zip_longest(inds, inds[1:], fillvalue=[None])}
{'A-1': ['iPad', 'iPod', 'iPhone'], 'A-3': ['Kindle'], 'A-2': ['Windows', 'X-box']}
You could do this if you want simple code, it's not pretty, but gets the job done.
list_1 = ['A-1','A-1','A-1','A-2','A-2','A-3']
list_2 = ['iPad','iPod','iPhone','Windows','X-box','Kindle']
list_1a = []
list_1b = []
list_1c = []
place = 0
for i in list_1[::1]:
if list_1[place] == 'A-1':
list_1a.append(list_2[place])
elif list_1[place] == 'A-2':
list_1b.append(list_2[place])
else:
list_1c.append(list_2[place])
place += 1

Categories