Shuffle two list at once with same order - python

I'm using the nltk library's movie_reviews corpus which contains a large number of documents. My task is get predictive performance of these reviews with pre-processing of the data and without pre-processing. But there is problem, in lists documents and documents2 I have the same documents and I need shuffle them in order to keep same order in both lists. I cannot shuffle them separately because each time I shuffle the list, I get other results. That is why I need to shuffle the at once with same order because I need compare them in the end (it depends on order). I'm using python 2.7
Example (in real are strings tokenized, but it is not relative):
documents = [(['plot : two teen couples go to a church party , '], 'neg'),
(['drink and then drive . '], 'pos'),
(['they get into an accident . '], 'neg'),
(['one of the guys dies'], 'neg')]
documents2 = [(['plot two teen couples church party'], 'neg'),
(['drink then drive . '], 'pos'),
(['they get accident . '], 'neg'),
(['one guys dies'], 'neg')]
And I need get this result after shuffle both lists:
documents = [(['one of the guys dies'], 'neg'),
(['they get into an accident . '], 'neg'),
(['drink and then drive . '], 'pos'),
(['plot : two teen couples go to a church party , '], 'neg')]
documents2 = [(['one guys dies'], 'neg'),
(['they get accident . '], 'neg'),
(['drink then drive . '], 'pos'),
(['plot two teen couples church party'], 'neg')]
I have this code:
def cleanDoc(doc):
stopset = set(stopwords.words('english'))
stemmer = nltk.PorterStemmer()
clean = [token.lower() for token in doc if token.lower() not in stopset and len(token) > 2]
final = [stemmer.stem(word) for word in clean]
return final
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
documents2 = [(list(cleanDoc(movie_reviews.words(fileid))), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle( and here shuffle documents and documents2 with same order) # or somehow

You can do it as:
import random
a = ['a', 'b', 'c']
b = [1, 2, 3]
c = list(zip(a, b))
random.shuffle(c)
a, b = zip(*c)
print a
print b
[OUTPUT]
['a', 'c', 'b']
[1, 3, 2]
Of course, this was an example with simpler lists, but the adaptation will be the same for your case.

I get a easy way to do this
import numpy as np
a = np.array([0,1,2,3,4])
b = np.array([5,6,7,8,9])
indices = np.arange(a.shape[0])
np.random.shuffle(indices)
a = a[indices]
b = b[indices]
# a, array([3, 4, 1, 2, 0])
# b, array([8, 9, 6, 7, 5])

from sklearn.utils import shuffle
a = ['a', 'b', 'c','d','e']
b = [1, 2, 3, 4, 5]
a_shuffled, b_shuffled = shuffle(np.array(a), np.array(b))
print(a_shuffled, b_shuffled)
#random output
#['e' 'c' 'b' 'd' 'a'] [5 3 2 4 1]

Shuffle an arbitray number of lists simultaneously.
from random import shuffle
def shuffle_list(*ls):
l =list(zip(*ls))
shuffle(l)
return zip(*l)
a = [0,1,2,3,4]
b = [5,6,7,8,9]
a1,b1 = shuffle_list(a,b)
print(a1,b1)
a = [0,1,2,3,4]
b = [5,6,7,8,9]
c = [10,11,12,13,14]
a1,b1,c1 = shuffle_list(a,b,c)
print(a1,b1,c1)
Output:
$ (0, 2, 4, 3, 1) (5, 7, 9, 8, 6)
$ (4, 3, 0, 2, 1) (9, 8, 5, 7, 6) (14, 13, 10, 12, 11)
Note:
objects returned by shuffle_list() are tuples.
P.S.
shuffle_list() can also be applied to numpy.array()
a = np.array([1,2,3])
b = np.array([4,5,6])
a1,b1 = shuffle_list(a,b)
print(a1,b1)
Output:
$ (3, 1, 2) (6, 4, 5)

Easy and fast way to do this is to use random.seed() with random.shuffle() . It lets you generate same random order many times you want.
It will look like this:
a = [1, 2, 3, 4, 5]
b = [6, 7, 8, 9, 10]
seed = random.random()
random.seed(seed)
a.shuffle()
random.seed(seed)
b.shuffle()
print(a)
print(b)
>>[3, 1, 4, 2, 5]
>>[8, 6, 9, 7, 10]
This also works when you can't work with both lists at the same time, because of memory problems.

You can store the order of the values in a variable, then sort the arrays simultaneously:
array1 = [1, 2, 3, 4, 5]
array2 = ["one", "two", "three", "four", "five"]
order = range(len(array1))
random.shuffle(order)
newarray1 = []
newarray2 = []
for x in range(len(order)):
newarray1.append(array1[order[x]])
newarray2.append(array2[order[x]])
print newarray1, newarray2

This works as well:
import numpy as np
a = ['a', 'b', 'c']
b = [1, 2, 3]
rng = np.random.default_rng()
state = rng.bit_generator.state
rng.shuffle(a)
# use same seeds for a & b!
rng.bit_generator.state = state # set state to same state as before
rng.shuffle(b)
print(a)
print(b)
Output:
['b', 'a', 'c']
[2, 1, 3]

You can use the second argument of the shuffle function to fix the order of shuffling.
Specifically, you can pass the second argument of shuffle function a zero argument function which returns a value in [0, 1). The return value of this function fixes the order of shuffling. (By default i.e. if you do not pass any function as the second argument, it uses the function random.random(). You can see it at line 277 here.)
This example illustrates what I described:
import random
a = ['a', 'b', 'c', 'd', 'e']
b = [1, 2, 3, 4, 5]
r = random.random() # randomly generating a real in [0,1)
random.shuffle(a, lambda : r) # lambda : r is an unary function which returns r
random.shuffle(b, lambda : r) # using the same function as used in prev line so that shuffling order is same
print a
print b
Output:
['e', 'c', 'd', 'a', 'b']
[5, 3, 4, 1, 2]

Related

Adding values in a list if they are the same, with a twist

I have a question (python) regarding adding values in a list if they share the same key in a first list. So for example we have:
lst1 = [A, A, B, A, C, D]
lst2 = [1, 2, 3, 4, 5, 6]
What I would like to know is how I can add the numbers in lst2 if the strings in lst1 are the sane. The end result would thus be:
new_lst1 = [A, B, A, C, D]
new_lst2 = [3, 3, 4, 5, 6]
where
new_lst2[0] = 1+2
So values only get added when the are next to each other.
To make it more complicated, it should also be possible if we have this example:
lst3 = [A, A, A, B, B, A, A]
lst4 = [1, 2, 3, 4, 5, 6, 7]
for which the result has to be:
new_lst3 = [A, B, A]
new_lst4 = [6, 9, 13]
where
new_lst4[0] = 1 + 2 + 3, new_lst4[1] = 4 + 5, and new_lst4[2] = 6 + 7.
Thank you in advance!
for a bit of background:
I wrote a code that searches in Dutch online underground models and returns the data of the underground of a specific input location.
The data is made up of layers:
Layer1, Name: "BXz1", top_layer, bottom_layer, transitivity
Layer2, Name: "BXz2", top_layer, bottom_layer, transitivity
Layer3, Name: "KRz1", top_layer, bottom_layer, transitivity
etc..
BXz1 and BXz2 are the same main layer however different sublayers. In terms of transitivity I would like to combine them if they are next to each other.
so in that way i would get:
Layer1+2, Name: BX, top_layer1, bottom_layer2, combined transitivity
Layer3, Name: "KRz1", top_layer, bottom_layer, transitivity
If you're not allowed to use libraries, you could do it with a simple loop using zip() to pair up the keys and values.
lst3 = ["A", "A", "A", "B", "B", "A", "A"]
lst4 = [1, 2, 3, 4, 5, 6, 7]
new_lst3,new_lst4 = lst3[:1],[0] # initialize with first key
for k,n in zip(lst3,lst4): # pair up keys and numbers
if new_lst3[-1] != k: # add new items if key changed
new_lst3.append(k)
new_lst4.append(0)
new_lst4[-1] += n # tally for current key
print(new_lst3) # ['A', 'B', 'A']
print(new_lst4) # [6, 9, 13]
If you're okay with libraries, groupby from itertools combined with an iterator on the keys will allow you to express it more concisely:
from itertools import groupby
tally = ((k,sum(n)) for i3 in [iter(lst3)]
for k,n in groupby(lst4,lambda _:next(i3)))
new_lst3,new_lst4 = map(list,zip(*tally))
The itertools.groupby function in the standard library provides the base functionality you need. It's then a matter of giving it the right key and tallying the count in each group.
Here's my implementation:
from itertools import groupby
def tally_by_group(keys, counts):
groups = groupby(zip(keys, counts), key=lambda x: x[0])
tallies = [
(key, sum(count for _, count in group))
for key, group in groups
]
return tuple(list(l) for l in zip(*tallies))
Code explanation:
my first zip() creates (key, count) tuples from the two lists,
the groupby groups them by the first element of each tuple, i.e., by the key,
then I construct a list of (key, sum(count)) into tallies,
and finally unpack that back into two lists for the results.
Tests with your examples:
lst1 = ["A", "A", "B", "A", "C", "D"]
lst2 = [1, 2, 3, 4, 5, 6]
l1m, l2m = tally_by_group(lst1, lst2)
print(l1m)
print(l2m)
outputs:
['A', 'B', 'A', 'C', 'D']
[3, 3, 4, 5, 6]
And
lst3 = ["A", "A", "A", "B", "B", "A", "A"]
lst4 = [1, 2, 3, 4, 5, 6, 7]
l3m, l4m = tally_by_group(lst3, lst4)
print(l3m)
print(l4m)
outputs:
['A', 'B', 'A']
[6, 9, 13]

Finding original variable names of the important attibutes in a FAMD PCA using Prince

I'm using the package Prince to perform a FAMD on data that consists of mixed data (so both categorical and non-categorical).
My code is the following:
famd = prince.FAMD(n_components=10, n_iter=3, copy=True, check_input=True, engine='auto', random_state=42)
famd = famd.fit(df_pca)
Which gives as output
Explained inertia
[0.08057161 0.05946225 0.03875787 0.03203083 0.02978785 0.02868602
0.02499968 0.02416245 0.02207422 0.02055546]
I have already tried df = pd.DataFrame(pca.components_, columns=list(dfPca.columns)) as mentioned in PCA on sklearn - how to interpret pca.components_ . Next to that I have attempted to implement the solution offered by user seralouk with some minor changes to make it fit the Prince FAMD.
n_pcs = len(inertia)
most_important = [inertia[i].argmax() for i in range(n_pcs)]
initial_feature_names = df_pca.columns
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
pca_results = pd.DataFrame(dic.items())
However this does not appear to work for the Prince FAMD. Are there any ways to link the output of the FAMD to the original variable names?
The link you cited is for a pca in sklearn. You are using a famd from another package now which is quite different altogether.
In the link cited, the solution by #seralouk basically goes through the eigenvector for each PC and takes out the column with highest absolute weight. Note this is NOT linking each component to the original column. This is finding the original column which contribute most to the PC.
You can do something like below, but I would suggest reading up on FAMD / PCA in this book to be sure of what you are actually extracting:
Below is a rough implementation to get the columns that contribute most to each component, using the V matrix. Using the example in prince help page:
import pandas as pd
X = pd.DataFrame(
data=[
['A', 'A', 'A', 2, 5, 7, 6, 3, 6, 7],
['A', 'A', 'A', 4, 4, 4, 2, 4, 4, 3],
['B', 'A', 'B', 5, 2, 1, 1, 7, 1, 1],
['B', 'A', 'B', 7, 2, 1, 2, 2, 2, 2],
['B', 'B', 'B', 3, 5, 6, 5, 2, 6, 6],
['B', 'B', 'A', 3, 5, 4, 5, 1, 7, 5]
],
columns=['E1 fruity', 'E1 woody', 'E1 coffee',
'E2 red fruit', 'E2 roasted', 'E2 vanillin', 'E2 woody',
'E3 fruity', 'E3 butter', 'E3 woody'],
index=['Wine {}'.format(i+1) for i in range(6)]
)
import prince
import numpy as np
n_pcs = 3
famd = prince.FAMD(n_components=n_pcs)
famd = famd.fit(X)
most_important = np.abs(famd.V_).argmax(axis=1)
initial_feature_names = X.columns
most_important_names = initial_feature_names[most_important]
dic = {'PC{}'.format(i+1): most_important_names[i] for i in range(n_pcs)}
pca_results = pd.DataFrame(dic.items())
0 1
0 PC1 E1 coffee
1 PC2 E1 woody
2 PC3 E2 red fruit

Python adding a list to a slice of another list

Here's basic problem:
>>> listb = [ 1, 2, 3, 4, 5, 6, 7 ]
>>> slicea = slice(2,5)
>>> listb[slicea]
[3, 4, 5]
>>> lista = listb[slicea]
>>> lista
[3, 4, 5]
>>> listb[slicea] += lista
>>> listb
[1, 2, 3, 4, 5, 3, 4, 5, 6, 7]
listb should be
[1, 2, 6, 8, 10, 6, 7]
But 3, 4, 5 was inserted after 3, 4, 5 not added to it.
tl;dr
I have this code that's not working:
self.lib_tree.item(song)['values'][select_values] = adj_list
self.lib_tree.item(album)['values'][select_values] += adj_list
self.lib_tree.item(artist)['values'][select_values] += adj_list
The full code is this:
def toggle_select(self, song, album, artist):
# 'values' 0=Access, 1=Size, 2=Selected Size, 3=StatTime, 4=StatSize,
# 5=Count, 6=Seconds, 7=SelSize, 8=SelCount, 9=SelSeconds
# Set slice to StatSize, Count, Seconds
total_values = slice(4, 7) # start at index, stop before index
select_values = slice(7, 10) # start at index, stop before index
tags = self.lib_tree.item(song)['tags']
if "songsel" in tags:
# We will toggle off and subtract from selected parent totals
tags.remove("songsel")
self.lib_tree.item(song, tags=(tags))
# Get StatSize, Count and Seconds
adj_list = [element * -1 for element in \
self.lib_tree.item(song)['values'][total_values]]
else:
tags.append("songsel")
self.lib_tree.item(song, tags=(tags))
# Get StatSize, Count and Seconds
adj_list = self.lib_tree.item(song)['values'][total_values] # 1 past
self.lib_tree.item(song)['values'][select_values] = adj_list
self.lib_tree.item(album)['values'][select_values] += adj_list
self.lib_tree.item(artist)['values'][select_values] += adj_list
if self.debug_toggle < 10:
self.debug_toggle += 1
print('artist,album,song:',self.lib_tree.item(artist, 'text'), \
self.lib_tree.item(album, 'text'), \
self.lib_tree.item(song, 'text'))
print('adj_list:',adj_list)
The adj_list has the correct values showing up in debug.
How do I add a list of values to the slice of a list?
The behavior you want is not a feature of any Python built-in type; + with built-in sequences means concatenation, not element-wise addition. But numpy arrays will do what you want, so I'd suggest looking into numpy. Simple example:
>>> import numpy as np
>>> a = np.array([2,3,4], dtype=np.int64)
>>> b = np.array([5,6,7], dtype=np.int64)
>>> a += b
>>> a
array([ 7, 9, 11])
>>> print(a)
[ 7 9 11]
>>> print(a.tolist())
[7, 9, 11]
Note that the output (both repr and str forms) looks a little different from Python lists, but you can convert back to a plain Python list if needed.

How to efficiently count each element in a list in Python? [duplicate]

This question already has answers here:
Using a dictionary to count the items in a list
(8 answers)
Closed 7 months ago.
Given an unordered list of values like
a = [5, 1, 2, 2, 4, 3, 1, 2, 3, 1, 1, 5, 2]
How can I get the frequency of each value that appears in the list, like so?
# `a` has 4 instances of `1`, 4 of `2`, 2 of `3`, 1 of `4,` 2 of `5`
b = [4, 4, 2, 1, 2] # expected output
In Python 2.7 (or newer), you can use collections.Counter:
>>> import collections
>>> a = [5, 1, 2, 2, 4, 3, 1, 2, 3, 1, 1, 5, 2]
>>> counter = collections.Counter(a)
>>> counter
Counter({1: 4, 2: 4, 5: 2, 3: 2, 4: 1})
>>> counter.values()
dict_values([2, 4, 4, 1, 2])
>>> counter.keys()
dict_keys([5, 1, 2, 4, 3])
>>> counter.most_common(3)
[(1, 4), (2, 4), (5, 2)]
>>> dict(counter)
{5: 2, 1: 4, 2: 4, 4: 1, 3: 2}
>>> # Get the counts in order matching the original specification,
>>> # by iterating over keys in sorted order
>>> [counter[x] for x in sorted(counter.keys())]
[4, 4, 2, 1, 2]
If you are using Python 2.6 or older, you can download an implementation here.
If the list is sorted, you can use groupby from the itertools standard library (if it isn't, you can just sort it first, although this takes O(n lg n) time):
from itertools import groupby
a = [5, 1, 2, 2, 4, 3, 1, 2, 3, 1, 1, 5, 2]
[len(list(group)) for key, group in groupby(sorted(a))]
Output:
[4, 4, 2, 1, 2]
Python 2.7+ introduces Dictionary Comprehension. Building the dictionary from the list will get you the count as well as get rid of duplicates.
>>> a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
>>> d = {x:a.count(x) for x in a}
>>> d
{1: 4, 2: 4, 3: 2, 4: 1, 5: 2}
>>> a, b = d.keys(), d.values()
>>> a
[1, 2, 3, 4, 5]
>>> b
[4, 4, 2, 1, 2]
Count the number of appearances manually by iterating through the list and counting them up, using a collections.defaultdict to track what has been seen so far:
from collections import defaultdict
appearances = defaultdict(int)
for curr in a:
appearances[curr] += 1
In Python 2.7+, you could use collections.Counter to count items
>>> a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
>>>
>>> from collections import Counter
>>> c=Counter(a)
>>>
>>> c.values()
[4, 4, 2, 1, 2]
>>>
>>> c.keys()
[1, 2, 3, 4, 5]
Counting the frequency of elements is probably best done with a dictionary:
b = {}
for item in a:
b[item] = b.get(item, 0) + 1
To remove the duplicates, use a set:
a = list(set(a))
You can do this:
import numpy as np
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
np.unique(a, return_counts=True)
Output:
(array([1, 2, 3, 4, 5]), array([4, 4, 2, 1, 2], dtype=int64))
The first array is values, and the second array is the number of elements with these values.
So If you want to get just array with the numbers you should use this:
np.unique(a, return_counts=True)[1]
Here's another succint alternative using itertools.groupby which also works for unordered input:
from itertools import groupby
items = [5, 1, 1, 2, 2, 1, 1, 2, 2, 3, 4, 3, 5]
results = {value: len(list(freq)) for value, freq in groupby(sorted(items))}
results
format: {value: num_of_occurencies}
{1: 4, 2: 4, 3: 2, 4: 1, 5: 2}
I would simply use scipy.stats.itemfreq in the following manner:
from scipy.stats import itemfreq
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
freq = itemfreq(a)
a = freq[:,0]
b = freq[:,1]
you may check the documentation here: http://docs.scipy.org/doc/scipy-0.16.0/reference/generated/scipy.stats.itemfreq.html
from collections import Counter
a=["E","D","C","G","B","A","B","F","D","D","C","A","G","A","C","B","F","C","B"]
counter=Counter(a)
kk=[list(counter.keys()),list(counter.values())]
pd.DataFrame(np.array(kk).T, columns=['Letter','Count'])
seta = set(a)
b = [a.count(el) for el in seta]
a = list(seta) #Only if you really want it.
Suppose we have a list:
fruits = ['banana', 'banana', 'apple', 'banana']
We can find out how many of each fruit we have in the list like so:
import numpy as np
(unique, counts) = np.unique(fruits, return_counts=True)
{x:y for x,y in zip(unique, counts)}
Result:
{'banana': 3, 'apple': 1}
This answer is more explicit
a = [1,1,1,1,2,2,2,2,3,3,3,4,4]
d = {}
for item in a:
if item in d:
d[item] = d.get(item)+1
else:
d[item] = 1
for k,v in d.items():
print(str(k)+':'+str(v))
# output
#1:4
#2:4
#3:3
#4:2
#remove dups
d = set(a)
print(d)
#{1, 2, 3, 4}
For your first question, iterate the list and use a dictionary to keep track of an elements existsence.
For your second question, just use the set operator.
def frequencyDistribution(data):
return {i: data.count(i) for i in data}
print frequencyDistribution([1,2,3,4])
...
{1: 1, 2: 1, 3: 1, 4: 1} # originalNumber: count
I am quite late, but this will also work, and will help others:
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
freq_list = []
a_l = list(set(a))
for x in a_l:
freq_list.append(a.count(x))
print 'Freq',freq_list
print 'number',a_l
will produce this..
Freq [4, 4, 2, 1, 2]
number[1, 2, 3, 4, 5]
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
counts = dict.fromkeys(a, 0)
for el in a: counts[el] += 1
print(counts)
# {1: 4, 2: 4, 3: 2, 4: 1, 5: 2}
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
# 1. Get counts and store in another list
output = []
for i in set(a):
output.append(a.count(i))
print(output)
# 2. Remove duplicates using set constructor
a = list(set(a))
print(a)
Set collection does not allow duplicates, passing a list to the set() constructor will give an iterable of totally unique objects. count() function returns an integer count when an object that is in a list is passed. With that the unique objects are counted and each count value is stored by appending to an empty list output
list() constructor is used to convert the set(a) into list and referred by the same variable a
Output
D:\MLrec\venv\Scripts\python.exe D:/MLrec/listgroup.py
[4, 4, 2, 1, 2]
[1, 2, 3, 4, 5]
Simple solution using a dictionary.
def frequency(l):
d = {}
for i in l:
if i in d.keys():
d[i] += 1
else:
d[i] = 1
for k, v in d.iteritems():
if v ==max (d.values()):
return k,d.keys()
print(frequency([10,10,10,10,20,20,20,20,40,40,50,50,30]))
#!usr/bin/python
def frq(words):
freq = {}
for w in words:
if w in freq:
freq[w] = freq.get(w)+1
else:
freq[w] =1
return freq
fp = open("poem","r")
list = fp.read()
fp.close()
input = list.split()
print input
d = frq(input)
print "frequency of input\n: "
print d
fp1 = open("output.txt","w+")
for k,v in d.items():
fp1.write(str(k)+':'+str(v)+"\n")
fp1.close()
from collections import OrderedDict
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
def get_count(lists):
dictionary = OrderedDict()
for val in lists:
dictionary.setdefault(val,[]).append(1)
return [sum(val) for val in dictionary.values()]
print(get_count(a))
>>>[4, 4, 2, 1, 2]
To remove duplicates and Maintain order:
list(dict.fromkeys(get_count(a)))
>>>[4, 2, 1]
i'm using Counter to generate a freq. dict from text file words in 1 line of code
def _fileIndex(fh):
''' create a dict using Counter of a
flat list of words (re.findall(re.compile(r"[a-zA-Z]+"), lines)) in (lines in file->for lines in fh)
'''
return Counter(
[wrd.lower() for wrdList in
[words for words in
[re.findall(re.compile(r'[a-zA-Z]+'), lines) for lines in fh]]
for wrd in wrdList])
For the record, a functional answer:
>>> L = [1,1,1,1,2,2,2,2,3,3,4,5,5]
>>> import functools
>>> >>> functools.reduce(lambda acc, e: [v+(i==e) for i, v in enumerate(acc,1)] if e<=len(acc) else acc+[0 for _ in range(e-len(acc)-1)]+[1], L, [])
[4, 4, 2, 1, 2]
It's cleaner if you count zeroes too:
>>> functools.reduce(lambda acc, e: [v+(i==e) for i, v in enumerate(acc)] if e<len(acc) else acc+[0 for _ in range(e-len(acc))]+[1], L, [])
[0, 4, 4, 2, 1, 2]
An explanation:
we start with an empty acc list;
if the next element e of L is lower than the size of acc, we just update this element: v+(i==e) means v+1 if the index i of acc is the current element e, otherwise the previous value v;
if the next element e of L is greater or equals to the size of acc, we have to expand acc to host the new 1.
The elements do not have to be sorted (itertools.groupby). You'll get weird results if you have negative numbers.
Another approach of doing this, albeit by using a heavier but powerful library - NLTK.
import nltk
fdist = nltk.FreqDist(a)
fdist.values()
fdist.most_common()
Found another way of doing this, using sets.
#ar is the list of elements
#convert ar to set to get unique elements
sock_set = set(ar)
#create dictionary of frequency of socks
sock_dict = {}
for sock in sock_set:
sock_dict[sock] = ar.count(sock)
For an unordered list you should use:
[a.count(el) for el in set(a)]
The output is
[4, 4, 2, 1, 2]
Yet another solution with another algorithm without using collections:
def countFreq(A):
n=len(A)
count=[0]*n # Create a new list initialized with '0'
for i in range(n):
count[A[i]]+= 1 # increase occurrence for value A[i]
return [x for x in count if x] # return non-zero count
num=[3,2,3,5,5,3,7,6,4,6,7,2]
print ('\nelements are:\t',num)
count_dict={}
for elements in num:
count_dict[elements]=num.count(elements)
print ('\nfrequency:\t',count_dict)
You can use the in-built function provided in python
l.count(l[i])
d=[]
for i in range(len(l)):
if l[i] not in d:
d.append(l[i])
print(l.count(l[i])
The above code automatically removes duplicates in a list and also prints the frequency of each element in original list and the list without duplicates.
Two birds for one shot ! X D
This approach can be tried if you don't want to use any library and keep it simple and short!
a = [1,1,1,1,2,2,2,2,3,3,4,5,5]
marked = []
b = [(a.count(i), marked.append(i))[0] for i in a if i not in marked]
print(b)
o/p
[4, 4, 2, 1, 2]

Various list concatenation method and their performance

I was working on an algorithm and in that, we are trying to write every line in the code such that it adds up a good performance to the final code.
In one situation we have to add lists (more than two specifically). I know some of the ways to join more than two lists also I have looked upon StackOverflow but none of the answers are giving account on the performance of the method.
Can anyone show, what are the ways we can join more than two lists and their respective performance?
Edit : The size of the list is varying from 2 to 13 (to be specific).
Edit Duplicate : I have been specifically asking for the ways we can add and their respected questions and in duplicate question its limited to only 4 methods
There are multiples ways using which you can join more than two list.
Assuming that we have three list,
a = ['1']
b = ['2']
c = ['3']
Then, for joining two or more lists in python,
1)
You can simply concatenate them,
output = a + b + c
2)
You can do it using list comprehension as well,
res_list = [y for x in [a,b,c] for y in x]
3)
You can do it using extend() as well,
a.extend(b)
a.extend(c)
print(a)
4)
You can do it by using * operator as well,
res = [*a,*b,*c]
For calculating performance, I have used timeit module present in python.
The performance of the following methods are;
4th method < 1st method < 3rd method < 2nd [method on the basis of
time]
That means If you are going to use " * operator " for concatenation of more than two lists then you will get the best performance.
Hope you got what you were looking for.
Edit:: Image showing performance of all the methods (Calculated using timeit)
I did some simple measurements, here are my results:
import timeit
from itertools import chain
a = [*range(1, 10)]
b = [*range(1, 10)]
c = [*range(1, 10)]
tests = ("""output = list(chain(a, b, c))""",
"""output = a + b + c""",
"""output = [*chain(a, b, c)]""",
"""output = a.copy();output.extend(b);output.extend(c);""",
"""output = [*a, *b, *c]""",
"""output = a.copy();output+=b;output+=c;""",
"""output = a.copy();output+=[*b, *c]""",
"""output = a.copy();output += b + c""")
results = sorted((timeit.timeit(stmt=test, number=1, globals=globals()), test) for test in tests)
for i, (t, stmt) in enumerate(results, 1):
print(f'{i}.\t{t}\t{stmt}')
Prints on my machine (AMD 2400G, Python 3.6.7):
1. 6.010000106471125e-07 output = [*a, *b, *c]
2. 7.109999842214165e-07 output = a.copy();output += b + c
3. 7.720000212430023e-07 output = a.copy();output+=b;output+=c;
4. 7.820001428626711e-07 output = a + b + c
5. 1.0520000159885967e-06 output = a.copy();output+=[*b, *c]
6. 1.4030001693754457e-06 output = a.copy();output.extend(b);output.extend(c);
7. 1.4820000160398195e-06 output = [*chain(a, b, c)]
8. 2.525000127207022e-06 output = list(chain(a, b, c))
If you are going to concatenate a variable number of lists together, your input is going to be a list of lists (or some equivalent collection). The performance tests need to take this into account because you are not going to be able to do things like list1+list2+list3.
Here are some test results (1000 repetitions):
option1 += loop 0.00097 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option2 itertools.chain 0.00138 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option3 functools.reduce 0.00174 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option4 comprehension 0.00188 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option5 extend loop 0.00127 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option6 deque 0.00180 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
This would indicate that a += loop through the list of lists is the fastest approach
And the source to produce them:
allLists = [ list(range(10)) for _ in range(5) ]
def option1():
result = allLists[0].copy()
for lst in allLists[1:]:
result += lst
return result
from itertools import chain
def option2(): return list(chain(*allLists))
from functools import reduce
def option3():
return list(reduce(lambda a,b:a+b,allLists))
def option4(): return [ e for l in allLists for e in l ]
def option5():
result = allLists[0].copy()
for lst in allLists[1:]:
result.extend(lst)
return result
from collections import deque
def option6():
result = deque()
for lst in allLists:
result.extend(lst)
return list(result)
from timeit import timeit
count = 1000
t = timeit(lambda:option1(), number = count)
print(f"option1 += loop {t:.5f}",option1()[:15])
t = timeit(lambda:option2(), number = count)
print(f"option2 itertools.chain {t:.5f}",option2()[:15])
t = timeit(lambda:option3(), number = count)
print(f"option3 functools.reduce {t:.5f}",option3()[:15])
t = timeit(lambda:option4(), number = count)
print(f"option4 comprehension {t:.5f}",option4()[:15])
t = timeit(lambda:option5(), number = count)
print(f"option5 extend loop {t:.5f}",option5()[:15])
t = timeit(lambda:option6(), number = count)
print(f"option6 deque {t:.5f}",option6()[:15])

Categories