Shortest way to linearize a list in Python - python

I want to make a list with linearly increasing values from a list with non-linearly increasing values in Python. For example
input =[10,10,10,6,6,4,1,1,1,10,10]
should be transformed to:
output=[0,0,0,1,1,2,3,3,3,0,0]
My code uses a python dictionary
def linearize(input):
"""
Remap a input list containing values in non linear-indices list
i.e.
input = [10,10,10,6,6,3,1,1]
output= [0,0,0,1,1,2,3,3]
"""
remap={}
i=0
output=[0]*len(input)
for x in input:
if x not in remap.keys():
remap[x]=i
i=i+1
for i in range(0,len(input)):
output[i]=remap[input[i]]
return output
but I know this code can be more efficient. Some ideas to do this task better and in a more pythonic way, Numpy is an option?
This function has to be called very frequently on big lists.

As per your comment in the question, you are looking for something like this
data = [8,8,6,6,3,8]
from itertools import count
from collections import defaultdict
counter = defaultdict(lambda x=count(): next(x))
print([counter[item] for item in data])
# [0, 0, 1, 1, 2, 0]
Thanks to poke,
list(map(lambda i, c=defaultdict(lambda c=count(): next(c)): c[i], data))
Its just a one liner now :)

Use collections.OrderedDict:
In [802]: from collections import OrderedDict
...: odk=OrderedDict.fromkeys(l).keys()
...: odk={k:i for i, k in enumerate(odk)}
...: [odk[i] for i in l]
Out[802]: [0, 0, 0, 1, 1, 2, 3, 3, 3]

A simpler solution without imports:
input =[10,10,10,6,6,4,1,1,1,10,10]
d = {}
result = [d.setdefault(x, len(d)) for x in input]

I came up with this function using numpy which in my tests worked faster than yours when input list was very big like 2,000,000 elements.
import numpy as np
def linearize(input):
unique, inverse = np.unique(input, return_inverse=True)
output = (len(unique)-1) - inverse
return output
Also, this function only works if your input is in Descending order like your example.
Let me know if it helps.

Related

find unique lists inside another list in an efficient way

solution = [[1,0,0],[0,1,0], [1,0,0], [1,0,0]]
I have the above nested list, which contain some other lists inside it, how do we need to get the unique lists inside the solution
output = [[1,0,0],[0,1,0]
Note: each list is of same size
Things I have tried :
Take each list and compare with all other lists to see if its duplicated or not ? but it is very slow..
How can I check before inserting inserting list , is there any deuplicate of it so to avoid inserting duplicates
If you don't care about the order, you can use set:
solution = [[1,0,0],[0,1,0],[1,0,0],[1,0,0]]
output = set(map(tuple, solution))
print(output) # {(1, 0, 0), (0, 1, 0)}
Since lists are mutable objects you can't really check identity very quickly. You could convert to tuple, however, and store the tuple-ized view of each list in a set.
Tuples are heterogenous immutable containers, unlike lists which are mutable and idiomatically homogenous.
from typing import List, Any
def de_dupe(lst: List[List[Any]]) -> List[List[Any]]:
seen = set()
output = []
for element in lst:
tup = tuple(element)
if tup in seen:
continue # we've already added this one
seen.add(tup)
output.append(element)
return output
solution = [[1,0,0],[0,1,0], [1,0,0], [1,0,0]]
assert de_dupe(solution) == [[1, 0, 0], [0, 1, 0]]
Pandas duplicate might be of help.
import pandas as pd
df=pd.DataFrame([[1,0,0],[0,1,0], [1,0,0], [1,0,0]])
d =df[~df.duplicated()].values.tolist()
Output
[[1, 0, 0], [0, 1, 0]]
or, since you tag multidimensional-array, you can use numpy approach.
import numpy as np
def unique_rows(a):
a = np.ascontiguousarray(a)
unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
arr=np.array([[1,0,0],[0,1,0], [1,0,0], [1,0,0]])
output=unique_rows(arr).tolist()
Based on the suggestion in this OP
try this solution :
x=[[1,0,0],[0,1,0], [1,0,0], [1,0,0]]
Import numpy and convert the nested list into a numpy array
import numpy as np
a1=np.array(x)
find unique across rows
a2 = np.unique(a1,axis=0)
Convert it back to a nested list
a2.tolist()
Hope this helps
while lists are not hashable and therefore inefficient to duplicate, tuples are. So one way would be to transform your list into tuples and duplicate those.
>>> solution_tuples = [(1,0,0), (0,1,0), (1,0,0), (1,0,0)]
>>> set(solution_tuples)
{(1, 0, 0), (0, 1, 0)}

In python, is there an efficient way of seperating an array with elements mapped to another array?

Let's say i have an arbitrary array np.array([1,2,3,4,5,6]) with another array that maps specific elements in the array to a group, np.array(['a','b', 'a','c','c', 'b']) and now I want to seperate them into three different array depending on the label/group given in the second array, so that they are a,b,c = narray([1,3]), narray([2,6]), narray([4,5]). Is a simple forloop the way to go or is there some efficient method I'm missing here?
When you write efficient, I assume that what you want here is actually fast.
I will try to discuss briefly asymptotic efficiency.
In this context, we refer to N as the input size and K as the number of unique values.
My approach solution would be to use a combination of np.argsort() and a custom-built groupby_np() specifically optimized for NumPy inputs:
import numpy as np
def groupby_np(arr, both=True):
n = len(arr)
extrema = np.nonzero(arr[:-1] != arr[1:])[0] + 1
if both:
last_i = 0
for i in extrema:
yield last_i, i
last_i = i
yield last_i, n
else:
yield 0
yield from extrema
yield n
def labeling_groupby_np(values, labels):
slicing = labels.argsort()
sorted_labels = labels[slicing]
sorted_values = values[slicing]
del slicing
result = {}
for i, j in groupby_np(sorted_labels, True):
result[sorted_labels[i]] = sorted_values[i:j]
return result
This has complexity O(N log N + K).
The N log N comes from the sorting step and the K comes from the last loop.
The interesting part is that the both the N-dependent and the K-dependent steps are fast because the N-dependent part is executed at low level, and the K-dependent part is O(1) and also fast.
A solution like the following (very similar to #theEpsilon answer):
import numpy as np
def labeling_loop(values, labels):
labeled = {}
for x, l in zip(values, labels):
if l not in labeled:
labeled[l] = [x]
else:
labeled[l].append(x)
return {k: np.array(v) for k, v in labeled.items()}
uses two loops and has O(N + K). I do not think you can easily avoid the second loop (without a significant speed penalty). As for the first loop, this is executed in Python, which carries a significant speed penalty on its own.
Another possibility is to use np.unique() which brings the main loop to a lower level. However, this brings other challenges, because once the unique values are extracted, there is no efficient way of extracting the information to construct the arrays you want without some NumPy advanced indexing, which is O(N). The overall complexity of these solutions is then O(K * N), but because the NumPy advanced indexing is done at lower level, this can land to relatively fast solution, although with a worse asymptotic complexity than alternatives.
Possible implementations include (similar to #AjayVerma's and #AKX's answers):
import numpy as np
def labeling_unique_bool(values, labels):
return {l: values[l == labels] for l in np.unique(labels)}
import numpy as np
def labeling_unique_nonzero(values, labels):
return {l: values[np.nonzero(l == labels)] for l in np.unique(labels)}
Additionally, one could consider a pre-sorting step to then speed up the slicing part by avoiding NumPy advanced indexing.
However, the sorting step can be more costly than the advanced indexing, and in general the proposed approach tends to be faster for the input I tested.
import numpy as np
def labeling_unique_argsort(values, labels):
uniques, counts = np.unique(labels, return_counts=True)
sorted_values = values[labels.argsort()]
bound = 0
result = {}
for x, c in zip(uniques, counts):
result[x] = sorted_values[bound:bound + c]
bound += c
return result
Another approach, which is neat in principle (same as my proposed approach), but slow in practice would be to use sorting and itertools.groupby():
import itertools
from operator import itemgetter
def labeling_groupby(values, labels):
slicing = labels.argsort()
sorted_labels = labels[slicing]
sorted_values = values[slicing]
del slicing
result = {}
for x, g in itertools.groupby(zip(sorted_labels, sorted_values), itemgetter(0)):
result[x] = np.fromiter(map(itemgetter(1), g), dtype=sorted_values.dtype)
return result
Finally, a Pandas based approach, which is quite concise and reasonably fast for larger inputs, but under-performing for smaller ones (similar to #Ehsan's answer):
def labeling_groupby_pd(values, labels):
df = pd.DataFrame({'values': values, 'labels': labels})
return df.groupby('labels').values.apply(lambda x: x.values).to_dict()
Now, talking is cheap, so let us attach some numbers to fast and slow and produce some plots for varying input sizes. The value of K is capped to 52 (lower and upper case letters of the English alphabet). When N is much larger than K, the probability of reaching capping value is very high.
Input is generated programmatically with the following:
def gen_input(n, p, labels=string.ascii_letters):
k = len(labels)
values = np.arange(n)
labels = np.array([string.ascii_letters[i] for i in np.random.randint(0, int(k * p), n)])
return values, labels
and the benchmarks are produced for values of p from (1.0, 0.5, 0.1, 0.05), which change the maximum value of K. The plots below refer to the p values in that order.
p=1.0 (at most K = 52)
...and zoomed on the fastest
p=0.5 (at most K = 26)
p=0.1 (at most K = 5)
p=0.05 (at most K = 2)
...and zoomed on the fastest
One can see how the proposed method, except for very small inputs, outperforms the other methods proposed so far for the tested inputs.
(Full benchmarks available here).
One may also consider moving some parts of the looping to Numba / Cython, but I'd leave this to the interested reader.
You can use numpy.unique
x = np.array([1,2,3,4,5,6])
y = np.array(['a','b', 'a','c','c', 'b'])
print({value:x[y==value] for value in np.unique(y)})
Output
{'a': array([1, 3]), 'b': array([2, 6]), 'c': array([4, 5])}
This is a textbook use of pandas groupby:
import pandas as pd
df = pd.DataFrame({'A':[1,2,3,4,5,6],'B':['a','b','a','c','c','b']})
a,b,c = df.groupby('B').A.apply(lambda x:x.values)
#[1 3], [2 6], [4 5]
I'm sure there's some simple invocation to do all in one fell swoop and a Numpy guru will soon enlighten us, but
indices = np.array([1,2,3,4,5,6])
values = np.array(['a', 'b', 'a', 'c', 'c', 'b'])
indices_by_value = {}
for value in np.unique(values):
indices_by_value[value] = indices[values == value]
will leave you with
{'a': array([1, 3]), 'b': array([2, 6]), 'c': array([4, 5])}
You can do something like this:
from collections import defaultdict
d = defaultdict(list)
letters = ['a', 'b', 'a', 'c', 'c', 'b']
numbers = [1, 2, 3, 4, 5, 6]
for l, n in zip(letters, numbers):
d[l].append(n)
And d will have your answer.
Using the mask selection feature of numpy should do the work.
Something like this :
> import numpy as np
> xx = np.array(range(5))
> yy = np.array(['a','b','a','d','e'])
> yy=='a'
array([ True, False, True, False, False])
> xx[(yy=='a')]
array([0, 2])
Consider browsing the unique values of your array of label and build a dictionnary of matches incrementally.

List of strings to array of integers

From a list of strings, like this one:
example_list = ['010','101']
I need to get an array of integers, where each row is each one of the strings, being each character in one column, like this one:
example_array = np.array([[0,1,0],[1,0,1]])
I have tried with this code, but it isn't working:
example_array = np.empty([2,3],dtype=int)
i = 0 ; j = 0
for string in example_list:
for bit in string:
example_array[i,j] = int(bit)
j+=1
i+=1
Can anyone help me? I am using Python 3.6.
Thank you in advance for your help!
If all strings are the same length (this is crucial to building a contiguous array), then use view to efficiently separate the characters.
r = np.array(example_list)
r = r.view('<U1').reshape(*r.shape, -1).astype(int)
print(r)
array([[0, 1, 0],
[1, 0, 1]])
You could also go the list comprehension route.
r = np.array([[*map(int, list(l))] for l in example_list])
print(r)
array([[0, 1, 0],
[1, 0, 1]])
The simplest way is to use a list comprehension because it automatically generates the output list for you, which can be easily converted to a numpy array. You could do this using multiple for loops, but then you are stuck creating your list, sub lists, and appending to them. While not difficult, the code looks more elegant with list comprehensions.
Try this:
newList = np.array([[int(b) for b in a] for a in example_list])
newList now looks like this:
>>> newList
... [[0, 1, 0], [1, 0, 1]]
Note: there is not need to invoke map at this point, though that certainly works.
So what is going on here? We are iterating through your original list of strings (example_list) item-by-item, then iterating through each character within the current item. Functionally, this is equivalent to...
newList = []
for a in example_list:
tmpList = []
for b in a:
tmpList.append(int(b))
newList.append(tmpList)
newList = np.array(newList)
Personally, I find the multiple for loops to be easier to understand for beginners. However, once you grasp the list comprehensions you probably won't want to go back.
You could do this with map:
example_array = map(lambda x: map(lambda y: int(y), list(x)), example_list)
The outer lambda performs a list(x) operation on each item in example_list. For example, '010' => ['0','1','0']. The inner lambda converts the individual characters (resultants from list(x)) to integers. For example, ['0','1','0'] => [0,1,0].

Python: How to index the elements of a numpy array?

I'm looking for a function that would do what the function indices does in the following hypothetical code:
indices( numpy.array([[1, 2, 3], [2, 3, 4]]) )
{1: [(0,0)], 2: [(0,1),(1,0)], 3: [(0,2),(1,1)], 4: [(1,2)]}
Specifically, I want to produce a dictionary whose keys are the unique elements in the flattened array and whose values are lists of the full indices of the respective key.
I've looked at the where function, but it does not seem to provide an efficient way to solve this for large arrays. What's the best way to do this?
Notes: I'm using Python 2.7
Given that your desired output is a dictionary, I don't think there's going to be an efficient way to do this with NumPy operations. Your best bet will probably be something like
import collections
import itertools
d = collections.defaultdict(list)
for indices in itertools.product(*map(range, a.shape)):
d[a[indices]].append(indices)
the numpy_indexed package can perform these kind of grouping operations in an efficient and fully vectorized manner, ie:
import numpy_indexed as npi
a = np.array([[1, 2, 3], [2, 3, 4]])
keys, values = npi.group_by(a.reshape(-1), np.indices(a.shape).reshape(-1, a.ndim))
I don't know about numpy, but this is an example solution if just using arrays:
arrs = [[1, 2, 3], [2, 3, 4]]
dict = {}
for i in range(0, len(arrs)):
arr = arrs[i]
for j in range(0, len(arr)):
num = arr[j]
indices = dict.get(num)
if indices is None:
dict[num] = [(i, j)]
else:
dict[num].append((i, j))

How to replace values at specific indexes of a python list?

If I have a list:
to_modify = [5,4,3,2,1,0]
And then declare two other lists:
indexes = [0,1,3,5]
replacements = [0,0,0,0]
How can I take to_modify's elements as index to indexes, then set corresponding elements in to_modify to replacements, i.e. after running, indexes should be [0,0,3,0,1,0].
Apparently, I can do this through a for loop:
for ind in to_modify:
indexes[to_modify[ind]] = replacements[ind]
But is there other way to do this?
Could I use operator.itemgetter somehow?
The biggest problem with your code is that it's unreadable. Python code rule number one, if it's not readable, no one's gonna look at it for long enough to get any useful information out of it. Always use descriptive variable names. Almost didn't catch the bug in your code, let's see it again with good names, slow-motion replay style:
to_modify = [5,4,3,2,1,0]
indexes = [0,1,3,5]
replacements = [0,0,0,0]
for index in indexes:
to_modify[indexes[index]] = replacements[index]
# to_modify[indexes[index]]
# indexes[index]
# Yo dawg, I heard you liked indexes, so I put an index inside your indexes
# so you can go out of bounds while you go out of bounds.
As is obvious when you use descriptive variable names, you're indexing the list of indexes with values from itself, which doesn't make sense in this case.
Also when iterating through 2 lists in parallel I like to use the zip function (or izip if you're worried about memory consumption, but I'm not one of those iteration purists). So try this instead.
for (index, replacement) in zip(indexes, replacements):
to_modify[index] = replacement
If your problem is only working with lists of numbers then I'd say that #steabert has the answer you were looking for with that numpy stuff. However you can't use sequences or other variable-sized data types as elements of numpy arrays, so if your variable to_modify has anything like that in it, you're probably best off doing it with a for loop.
numpy has arrays that allow you to use other lists/arrays as indices:
import numpy
S=numpy.array(s)
S[a]=m
Why not just:
map(s.__setitem__, a, m)
You can use operator.setitem.
from operator import setitem
a = [5, 4, 3, 2, 1, 0]
ell = [0, 1, 3, 5]
m = [0, 0, 0, 0]
for b, c in zip(ell, m):
setitem(a, b, c)
>>> a
[0, 0, 3, 0, 1, 0]
Is it any more readable or efficient than your solution? I am not sure!
A little slower, but readable I think:
>>> s, l, m
([5, 4, 3, 2, 1, 0], [0, 1, 3, 5], [0, 0, 0, 0])
>>> d = dict(zip(l, m))
>>> d #dict is better then using two list i think
{0: 0, 1: 0, 3: 0, 5: 0}
>>> [d.get(i, j) for i, j in enumerate(s)]
[0, 0, 3, 0, 1, 0]
for index in a:
This will cause index to take on the values of the elements of a, so using them as indices is not what you want. In Python, we iterate over a container by actually iterating over it.
"But wait", you say, "For each of those elements of a, I need to work with the corresponding element of m. How am I supposed to do that without indices?"
Simple. We transform a and m into a list of pairs (element from a, element from m), and iterate over the pairs. Which is easy to do - just use the built-in library function zip, as follows:
for a_element, m_element in zip(a, m):
s[a_element] = m_element
To make it work the way you were trying to do it, you would have to get a list of indices to iterate over. This is doable: we can use range(len(a)) for example. But don't do that! That's not how we do things in Python. Actually directly iterating over what you want to iterate over is a beautiful, mind-liberating idea.
what about operator.itemgetter
Not really relevant here. The purpose of operator.itemgetter is to turn the act of indexing into something, into a function-like thing (what we call "a callable"), so that it can be used as a callback (for example, a 'key' for sorting or min/max operations). If we used it here, we'd have to re-call it every time through the loop to create a new itemgetter, just so that we could immediately use it once and throw it away. In context, that's just busy-work.
You can solve it using dictionary
to_modify = [5,4,3,2,1,0]
indexes = [0,1,3,5]
replacements = [0,0,0,0]
dic = {}
for i in range(len(indexes)):
dic[indexes[i]]=replacements[i]
print(dic)
for index, item in enumerate(to_modify):
for i in indexes:
to_modify[i]=dic[i]
print(to_modify)
The output will be
{0: 0, 1: 0, 3: 0, 5: 0}
[0, 0, 3, 0, 1, 0]
elif menu.lower() == "edit":
print ("Your games are: "+str (games))
remove = input("Which one do you want to edit: ")
add = input("What do you want to change it to: ")
for i in range(len(games)) :
if str(games[i]) == str(remove) :
games[i] = str(add)
break
else :
pass
pass
why not use it like this? replace directly from where it was removed and anyway you can add arrays and the do .sort the .reverse if needed

Categories