Repeat values in a column based on another

Repeat values in a column based on another - python

Suppose I have an array (not necessarily square)
my_array = ['name_1', 3
'name_2', 2]
and I want to end up with an list (or numpy array etc) of length 3+2=5 where the first three positions are assigned to 'name_1' and the next 2 to 'name_2'. Hence the output will be
['name_1', 'name_1', 'name_1', 'name_2', 'name_2']
This is what I have done so far; Is there a better way to do this please?
import numpy as np
my_array = np.array([['name_1', 3], ['name_2', 2]])
l = []
for i in range(my_array.shape[0]):
x = [my_array[i, 0].tolist()] * np.int(my_array[i, 1])
l.append(x)
flat_list = [item for sublist in l for item in sublist]
print(flat_list)
which prints:
['name_1', 'name_1', 'name_1', 'name_2', 'name_2']
Thanks!

Use np.repeat:
my_array[:,0].repeat(my_array[:,1].astype(int))
# array(['name_1', 'name_1', 'name_1', 'name_2', 'name_2'], dtype='<U6')

You can use a combination of list multiplication and sum:
sum(([my_array[i]] * my_array[i+1] for i in range(0, len(my_array), 2)), [])

I'm not a numpy expert, and "better" way is subjective. Here's one way using itertools
from itertools import chain, repeat
chain.from_iterable(repeat(elem, count) for elem, count in zip(my_array[::2], my_array[1::2]))
Here's a breakdown of how it works.
my_array[::2] returns a slice that is every other element, since the first and and second arguments are left empty, it starts at 0 and goes to the end. So that will be all of your first column, which are your input elements. Your counts are in the other column, so we can use my_array[1::2] to get the counts. These slices are nice, because they don't create new copies of your array, just "views" that skip every other element and start at some offset.
Now we want to enumerate those in pairs. Using zip() is handy for that. It consumes iterators/generators/sequences in parallel and gives an individual binding for each element. So as we zip through in the for construct, we bind each element to elem, and each count to count.
The for in construct allows us to provide a transform for each pair of arguments. Here we use repeat to build up virtual repetitions of each element. A nice thing again, is that we don't actually have to create new arrays. The repeat generator will just produce the input element N times.
Finally, we want a way to string all of this repeated elements into one flattened enumeration. That's where chain.from_iterable() comes in. It consumes an iterable of iterables, unrolling each one in series. Like the other pieces, chain will produce a new generator, not a new list, so we again save on memory. If you indeed want a list, you can feed it to list() at the end. Or just make it the input to a for in construct.
Here's all of that broken out into individual operations with intention-revealing-variables:
elements = my_array[::2]
counts = my_array[1::2]
bypairs = zip(elements, counts)
repeated = (repeat(elem, count) for elem, count in bypairs)
flattened = chain.from_iterable(repeated)
list(flattened)

Here's another approach, skipping itertools, favoring generators:
def expanded(matrix):
stream = iter(matrix)
for element, count in zip(stream, stream):
for _ in range(count):
yield element
list(expanded(my_array))

Using list comprehension:
In [3]: my_array = ['name_1', 3, 'name_2', 2]
In [4]: out = [i for i, j in zip(my_array[::2], my_array[1::2]) for _ in range(j)]
In [5]: out
Out[5]: ['name_1', 'name_1', 'name_1', 'name_2', 'name_2']

Related

Why does my string.split in Python return unwanted values? [duplicate]

This question already has answers here:
How to get the n next values of a generator into a list
(5 answers)
Fetch first 10 results from a list in Python
(4 answers)
Closed 13 days ago.
With linq I would
var top5 = array.Take(5);
How to do this with Python?

Slicing a list
top5 = array[:5]
To slice a list, there's a simple syntax: array[start:stop:step]
You can omit any parameter. These are all valid: array[start:], array[:stop], array[::step]
Slicing a generator
import itertools
top5 = itertools.islice(my_list, 5) # grab the first five elements
You can't slice a generator directly in Python. itertools.islice() will wrap an object in a new slicing generator using the syntax itertools.islice(generator, start, stop, step)
Remember, slicing a generator will exhaust it partially. If you want to keep the entire generator intact, perhaps turn it into a tuple or list first, like: result = tuple(generator)

import itertools
top5 = itertools.islice(array, 5)

#Shaikovsky's answer is excellent, but I wanted to clarify a couple of points.
[next(generator) for _ in range(n)]
This is the most simple approach, but throws StopIteration if the generator is prematurely exhausted.
On the other hand, the following approaches return up to n items which is preferable in many circumstances:
List:
[x for _, x in zip(range(n), records)]
Generator:
(x for _, x in zip(range(n), records))

In my taste, it's also very concise to combine zip() with xrange(n) (or range(n) in Python3), which works nice on generators as well and seems to be more flexible for changes in general.
# Option #1: taking the first n elements as a list
[x for _, x in zip(xrange(n), generator)]
# Option #2, using 'next()' and taking care for 'StopIteration'
[next(generator) for _ in xrange(n)]
# Option #3: taking the first n elements as a new generator
(x for _, x in zip(xrange(n), generator))
# Option #4: yielding them by simply preparing a function
# (but take care for 'StopIteration')
def top_n(n, generator):
for _ in xrange(n):
yield next(generator)

The answer for how to do this can be found here
>>> generator = (i for i in xrange(10))
>>> list(next(generator) for _ in range(4))
[0, 1, 2, 3]
>>> list(next(generator) for _ in range(4))
[4, 5, 6, 7]
>>> list(next(generator) for _ in range(4))
[8, 9]
Notice that the last call asks for the next 4 when only 2 are remaining. The use of the list() instead of [] is what gets the comprehension to terminate on the StopIteration exception that is thrown by next().

Do you mean the first N items, or the N largest items?
If you want the first:
top5 = sequence[:5]
This also works for the largest N items, assuming that your sequence is sorted in descending order. (Your LINQ example seems to assume this as well.)
If you want the largest, and it isn't sorted, the most obvious solution is to sort it first:
l = list(sequence)
l.sort(reverse=True)
top5 = l[:5]
For a more performant solution, use a min-heap (thanks Thijs):
import heapq
top5 = heapq.nlargest(5, sequence)

With itertools you will obtain another generator object so in most of the cases you will need another step the take the first n elements. There are at least two simpler solutions (a little bit less efficient in terms of performance but very handy) to get the elements ready to use from a generator:
Using list comprehension:
first_n_elements = [generator.next() for i in range(n)]
Otherwise:
first_n_elements = list(generator)[:n]
Where n is the number of elements you want to take (e.g. n=5 for the first five elements).

This should work
top5 = array[:5]

Indexing a list with an unique index

I have a list say l = [10,10,20,15,10,20]. I want to assign each unique value a certain "index" to get [1,1,2,3,1,2].
This is my code:
a = list(set(l))
res = [a.index(x) for x in l]
Which turns out to be very slow.
l has 1M elements, and 100K unique elements. I have also tried map with lambda and sorting, which did not help. What is the ideal way to do this?

You can do this in O(N) time using a defaultdict and a list comprehension:
>>> from itertools import count
>>> from collections import defaultdict
>>> lst = [10, 10, 20, 15, 10, 20]
>>> d = defaultdict(count(1).next)
>>> [d[k] for k in lst]
[1, 1, 2, 3, 1, 2]
In Python 3 use __next__ instead of next.
If you're wondering how it works?
The default_factory(i.e count(1).next in this case) passed to defaultdict is called only when Python encounters a missing key, so for 10 the value is going to be 1, then for the next ten it is not a missing key anymore hence the previously calculated 1 is used, now 20 is again a missing key and Python will call the default_factory again to get its value and so on.
d at the end will look like this:
>>> d
defaultdict(<method-wrapper 'next' of itertools.count object at 0x1057c83b0>,
{10: 1, 20: 2, 15: 3})

The slowness of your code arises because a.index(x) performs a linear search and you perform that linear search for each of the elements in l. So for each of the 1M items you perform (up to) 100K comparisons.
The fastest way to transform one value to another is looking it up in a map. You'll need to create the map and fill in the relationship between the original values and the values you want. Then retrieve the value from the map when you encounter another of the same value in your list.
Here is an example that makes a single pass through l. There may be room for further optimization to eliminate the need to repeatedly reallocate res when appending to it.
res = []
conversion = {}
i = 0
for x in l:
if x not in conversion:
value = conversion[x] = i
i += 1
else:
value = conversion[x]
res.append(value)

Well I guess it depends on if you want it to return the indexes in that specific order or not. If you want the example to return:
[1,1,2,3,1,2]
then you can look at the other answers submitted. However if you only care about getting a unique index for each unique number then I have a fast solution for you
import numpy as np
l = [10,10,20,15,10,20]
a = np.array(l)
x,y = np.unique(a,return_inverse = True)
and for this example the output of y is:
y = [0,0,2,1,0,2]
I tested this for 1,000,000 entries and it was done essentially immediately.

Your solution is slow because its complexity is O(nm) with m being the number of unique elements in l: a.index() is O(m) and you call it for every element in l.
To make it O(n), get rid of index() and store indexes in a dictionary:
>>> idx, indexes = 1, {}
>>> for x in l:
... if x not in indexes:
... indexes[x] = idx
... idx += 1
...
>>> [indexes[x] for x in l]
[1, 1, 2, 3, 1, 2]
If l contains only integers in a known range, you could also store indexes in a list instead of a dictionary for faster lookups.

You can use collections.OrderedDict() in order to preserve the unique items in order and, loop over the enumerate of this ordered unique items in order to get a dict of items and those indices (based on their order) then pass this dictionary with the main list to operator.itemgetter() to get the corresponding index for each item:
>>> from collections import OrderedDict
>>> from operator import itemgetter
>>> itemgetter(*lst)({j:i for i,j in enumerate(OrderedDict.fromkeys(lst),1)})
(1, 1, 2, 3, 1, 2)

For completness, you can also do it eagerly:
from itertools import count
wordid = dict(zip(set(list_), count(1)))
This uses a set to obtain the unique words in list_, pairs
each of those unique words with the next value from count() (which
counts upwards), and constructs a dictionary from the results.
Original answer, written by nneonneo.

What is the simplest and most efficient function to return a sublist based on an index list?

Say I have a list l:
['a','b','c','d','e']
and a list of indexes idx:
[1,3]
What is the simplest and most efficient function that will return:
['b','d']

Try using this:
[l[i] for i in idx]

You want operator.itemgetter.
In my first example, I'll show how you can use itemgetter to construct a callable which you can use on any indexable object:
from operator import itemgetter
items = itemgetter(1,3)
items(yourlist) #('b', 'd')
Now I'll show how you can use argument unpacking to store your indices as a list
from operator import itemgetter
a = ['a','b','c','d','e']
idx = [1,3]
items = itemgetter(*idx)
print items(a) #('b', 'd')
Of course, this gives you a tuple, not a list, but it's trivial to construct a list from a tuple if you really need to.

Here is an option using a list comprehension:
[v for i, v in enumerate(l) if i in idx]
This will be more efficient if you convert idx to a set first.
An alternative with operator.itemgetter:
import operator
operator.itemgetter(*idx)(l)
As noted in comments, [l[i] for i in idx] will probably be your best bet here, unless idx may contain indices greater than the length of l, or if idx is not ordered and you want to keep the same order as l.

python: how to know the index when you randomly select an element from a sequence with random.choice(seq)

I know very well how to select a random item from a list with random.choice(seq) but how do I know the index of that element?

import random
l = ['a','b','c','d','e']
i = random.choice(range(len(l)))
print i, l[i]

You could first choose a random index, then get the list element at that location to have both the index and value.
>>> import random
>>> a = [1, 2, 3, 4, 5]
>>> index = random.randint(0,len(a)-1)
>>> index
0
>>> a[index]
1

You can do it using randrange function from random module
import random
l = ['a','b','c','d','e']
i = random.randrange(len(l))
print i, l[i]

The most elegant way to do so is random.randrange:
index = random.randrange(len(MY_LIST))
value = MY_LIST[index]
One can also do this in python3, less elegantly (but still better than .index) with random.choice on a range object:
index = random.choice(range(len(MY_LIST)))
value = MY_LIST[index]
The only valid solutions are this solution and the random.randint solutions.
The ones which use list.index not only are slow (O(N) per lookup rather than O(1); gets really bad if you do this for each element, you'll have to do O(N^2) comparisons) but ALSO you will have skewed/incorrect results if the list elements are not unique.
One would think that this is slow, but it turns out to only be slightly slower than the other correct solution random.randint, and may be more readable. I personally consider it more elegant because one doesn't have to do numerical index fiddling and use unnecessary parameters as one has to do with randint(0,len(...)-1), but some may consider this a feature, though one needs to know the randint convention of an inclusive range [start, stop].
Proof of speed for random.choice: The only reason this works is that the range object is OPTIMIZED for indexing. As proof, you can do random.choice(range(10**12)); if it iterated through the entire list your machine would be slowed to a crawl.
edit: I had overlooked randrange because the docs seemed to say "don't use this function" (but actually meant "this function is pythonic, use it"). Thanks to martineau for pointing this out.
You could of course abstract this into a function:
def randomElement(sequence):
index = random.randrange(len(sequence))
return index,sequence[index]
i,value = randomElement(range(10**15)) # try THAT with .index, heh
# (don't, your machine will die)
# use xrange if using python2
# i,value = (268840440712786, 268840440712786)

If the values are unique in the sequence, you can always say: list.index(value)

Using randrage() as has been suggested is a great way to get the index. By creating a dictionary created via comprehension you can reduce this code to one line as shown below. Note that since this dictionary only has one element, when you call popitem() you get the combined index and value in a tuple.
import random
letters = "abcdefghijklmnopqrstuvwxyz"
# dictionary created via comprehension
idx, val = {i: letters[i] for i in [random.randrange(len(letters))]}.popitem()
print("index {} value {}" .format(idx, val))

We can use sample() method also.
If you want to randomly select n elements from list
import random
l, n = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2
index_list = random.sample(range(len(l)), n)
index_list will have unique indexes.
I prefer sample() over choices() as sample() does not allow duplicate elements in a sequence.

How to take the first N items from a generator or list? [duplicate]

This question already has answers here:
How to get the n next values of a generator into a list
(5 answers)
Fetch first 10 results from a list in Python
(4 answers)
Closed 12 days ago.
With linq I would
var top5 = array.Take(5);
How to do this with Python?

Slicing a list
top5 = array[:5]
To slice a list, there's a simple syntax: array[start:stop:step]
You can omit any parameter. These are all valid: array[start:], array[:stop], array[::step]
Slicing a generator
import itertools
top5 = itertools.islice(my_list, 5) # grab the first five elements
You can't slice a generator directly in Python. itertools.islice() will wrap an object in a new slicing generator using the syntax itertools.islice(generator, start, stop, step)
Remember, slicing a generator will exhaust it partially. If you want to keep the entire generator intact, perhaps turn it into a tuple or list first, like: result = tuple(generator)

import itertools
top5 = itertools.islice(array, 5)

#Shaikovsky's answer is excellent, but I wanted to clarify a couple of points.
[next(generator) for _ in range(n)]
This is the most simple approach, but throws StopIteration if the generator is prematurely exhausted.
On the other hand, the following approaches return up to n items which is preferable in many circumstances:
List:
[x for _, x in zip(range(n), records)]
Generator:
(x for _, x in zip(range(n), records))

In my taste, it's also very concise to combine zip() with xrange(n) (or range(n) in Python3), which works nice on generators as well and seems to be more flexible for changes in general.
# Option #1: taking the first n elements as a list
[x for _, x in zip(xrange(n), generator)]
# Option #2, using 'next()' and taking care for 'StopIteration'
[next(generator) for _ in xrange(n)]
# Option #3: taking the first n elements as a new generator
(x for _, x in zip(xrange(n), generator))
# Option #4: yielding them by simply preparing a function
# (but take care for 'StopIteration')
def top_n(n, generator):
for _ in xrange(n):
yield next(generator)

The answer for how to do this can be found here
>>> generator = (i for i in xrange(10))
>>> list(next(generator) for _ in range(4))
[0, 1, 2, 3]
>>> list(next(generator) for _ in range(4))
[4, 5, 6, 7]
>>> list(next(generator) for _ in range(4))
[8, 9]
Notice that the last call asks for the next 4 when only 2 are remaining. The use of the list() instead of [] is what gets the comprehension to terminate on the StopIteration exception that is thrown by next().

Do you mean the first N items, or the N largest items?
If you want the first:
top5 = sequence[:5]
This also works for the largest N items, assuming that your sequence is sorted in descending order. (Your LINQ example seems to assume this as well.)
If you want the largest, and it isn't sorted, the most obvious solution is to sort it first:
l = list(sequence)
l.sort(reverse=True)
top5 = l[:5]
For a more performant solution, use a min-heap (thanks Thijs):
import heapq
top5 = heapq.nlargest(5, sequence)

With itertools you will obtain another generator object so in most of the cases you will need another step the take the first n elements. There are at least two simpler solutions (a little bit less efficient in terms of performance but very handy) to get the elements ready to use from a generator:
Using list comprehension:
first_n_elements = [generator.next() for i in range(n)]
Otherwise:
first_n_elements = list(generator)[:n]
Where n is the number of elements you want to take (e.g. n=5 for the first five elements).

This should work
top5 = array[:5]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Repeat values in a column based on another - python

Use np.repeat: my_array[:,0].repeat(my_array[:,1].astype(int)) # array(['name_1', 'name_1', 'name_1', 'name_2', 'name_2'], dtype='<U6')

You can use a combination of list multiplication and sum: sum(([my_array[i]] * my_array[i+1] for i in range(0, len(my_array), 2)), [])

Here's another approach, skipping itertools, favoring generators: def expanded(matrix): stream = iter(matrix) for element, count in zip(stream, stream): for _ in range(count): yield element list(expanded(my_array))

Using list comprehension: In [3]: my_array = ['name_1', 3, 'name_2', 2] In [4]: out = [i for i, j in zip(my_array[::2], my_array[1::2]) for _ in range(j)] In [5]: out Out[5]: ['name_1', 'name_1', 'name_1', 'name_2', 'name_2']

Related

Why does my string.split in Python return unwanted values? [duplicate]

Indexing a list with an unique index

What is the simplest and most efficient function to return a sublist based on an index list?

python: how to know the index when you randomly select an element from a sequence with random.choice(seq)

How to take the first N items from a generator or list? [duplicate]

Categories

Resources