Merging a list of strings and a list of lists - python

This maybe a duplicate but I couldn't find a specific answer.
I also found one answer in composing this question but would like to know if there is a better option or one which works without knowing which item is a list of strings.
My question:
la=['a', 'b', 'c']
lb=[['d','e'], ['f','g'], ['i','j']]
I would like:
[['a','d','e'], ['b','f','g'], ['c','i','j']]
I discovered the following works specifically for my example;
la=['a', 'b', 'c']
lb=[['d','e'], ['f','g'], ['i','j']]
[ [x] + y for x,y in zip(la, lb)]
[['a', 'd', 'e'], ['b', 'f', 'g'], ['c', 'i', 'j']]
It works because I make the string list into a list before concatenating and avoids the TypeError: cannot concatenate 'str' and 'list' objects
Is there a more elegant solution?

You can use numpy.column_stack:
>>> la=['a', 'b', 'c']
>>> lb=[['d','e'], ['f','g'], ['i','j']]
>>> import numpy as np
>>> np.column_stack((la,lb))
array([['a', 'd', 'e'],
['b', 'f', 'g'],
['c', 'i', 'j']],
dtype='|S1')

If you want an expression I can't think of anything better than using zip as above. If you want to explicitly insert elements elements from la into elements of lb at their heads, I'd do
for i in range( len(la) ):
lb[i].insert(0, la[i])
which avoids having to know what zip is or does. Maybe also first check:
if len(la) != len(lb) : raise IndexError, "List lengths differ"
without that it'll "work" when lb is longer than la. BTW This isn't exactly the same wrt corner cases / duck typing. Seems safer to use insert, which method should exist only for a list-like object, than "+".
Also, purely personally, I'd write the above on one line
for i in range( len(la) ): lb[i].insert(0, la[i])

Related

How to efficiently get common items from two lists that may have duplicates?

my_list = ['a', 'b', 'a', 'd', 'e', 'f']
my_list_2 = ['a', 'b', 'c']
The common items are:
c = ['a', 'b', 'a']
The code:
for e in my_list:
if e in my_list_2:
c.append(e)
...
If the my_list is long, this would be very inefficient. If I convert both lists into two sets, then use set's intersection() function to get the common items, I will lose the duplicates in my_list.
How to deal with this efficiently?
dict is already a hashmap, so lookups are practically as efficient as a set, so you may not need to do any extra work collecting the values - if it wasn't, you could pack the values into a set to check before checking the dict
However, a large improvement may be to make a generator for the values, rather than creating a new intermediate list, to iterate over where you actually want the values
def foo(src_dict, check_list):
for value in check_list:
if value in my_dict:
yield value
With the edit, you may find you're better off packing all the inputs into a set
def foo(src_list, check_list):
hashmap = set(src_list)
for value in check_list:
if value in hashmap:
yield value
If you know a lot about the inputs, you can do better, but that's an unusual case (for example if the lists are ordered you could bisect, or if you have a huge verifying list or very very few values to check against it you may find some efficiency in the ordering and if you make a set)
I am not sure about time efficiency, but, personally speaking, list comprehension would always be more of interest to me:
[x for x in my_list if x in my_list_2]
Output
['a', 'b', 'a']
First, utilize the set.intersection() method to get the intersecting values in the list. Then, use a nested list comprehension to include the duplicates based on the original list's count on each value:
my_list = ['a', 'b', 'a', 'd', 'e', 'f']
my_list_2 = ['a', 'b', 'c']
c = [x for x in set(my_list).intersection(set(my_list_2)) for _ in range(my_list.count(x))]
print(c)
The above may be slower than just
my_list = ['a', 'b', 'a', 'd', 'e', 'f']
my_list_2 = ['a', 'b', 'c']
c = []
for e in my_list:
if e in my_list_2:
c.append(e)
print(c)
But when the lists are significantly larger, the code block utilizing the set.intersection() method will be significantly more efficient (faster).
sorry for not reading the post carefully and now it is not possible to delete.. however, it is an attempt for solution.
c = lambda my_list, my_list_2: (my_list, my_list_2, list(set(my_list).intersection(set(my_list_2))))
print("(list_1,list_2,duplicate_items) -", c(my_list, my_list_2))
Output:
(list_1,list_2,duplicate_items) -> (['a', 'b', 'a', 'd', 'e', 'f'], ['a', 'b', 'c'], ['b', 'a'])
or can be
[i for i in my_list if i in my_list_2]
output:
['a', 'b', 'a']

Python: Find underlying sequence of word/strings based on list incomplete sequences

Suppose following list of lists with strings (not necessarily letters):
[['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'], ['d', 'e', 'f'], ['f'], ['d', 'f']]
Each item in the list represents categorical data with underlying order like letters from the alphabet. Each string has a precursor and a successor (except for the first and last one) As you can see, some of the items ['a', 'b', 'd', 'f'] are nearly complete. This particular item does not contain the letter e for example. The item ['b', 'd', 'e', 'f'] does contain the letter e (presumably in correct order) but not the letter a. Together the items do contain the information about the underlying sequence of strings but none of the items alone can provide this information. I should mention that the letters are just an exammple. Otherwise, sorting would be easy.
I would like to obtain the unique sorted items based on alignment (alignment in the sense of sequence alignment of those lists. Like so:
['a', 'b', 'd', 'e', 'f']
I am sure this is a common problem which has been solved before but I have a hard time finding similar cases. This SO thread deals with a similar issue but the ground truth is known. Here I would like to find the underlying order of strings.
Unfortunately, the longest sequence is not guaranteed to start with e.g. 'a'
I had a look at difflib but I am not sure if this is the right toolbox. Any hints are appreciated.
EDIT:
I found a solution based on NetworkX
import networkx as nx
l = [['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'], ['d', 'e', 'f'], ['f'], ['d', 'f']]
# get tuples (start, stop)
new_l = []
for item in l:
for index, subitem in enumerate(item):
if len(item) > 1:
if index < len(item)-1:
new_l.append((subitem, item[index+1]))
# create a graph using those tuples
uo_graph = nx.DiGraph()
for item in new_l:
uo_graph.add_edge(*item)
[item for item in nx.topological_sort(uo_graph)]
Out[10]: ['a', 'b', 'd', 'e', 'f']
Would be interesting if there are more pythonic solutions to this kind of problem. Especially, it would be interesting to know how to apply a check if there are multiple solutions.
Ok, I just couldn't stop thinking about this problem, and now I have finally found the solution I was looking for. This only works if your problem is in deed well-definied, meaning that a) there are no contradictions in your dataset, and b) that there is only one order that can be derived from it, with no room for ambiguity.
Based on this assumption, I noticed that your problem looks very similar to a zebra, or Einstein puzzle. I knew about this type of riddle, but it took me a very long time to figure out what it was actually called.
At first, I thought you could tackle this type of problem by solving a system of linear equations, but soon I realized that it is not that simple. So then I searched for Python libraries that can solve these kinds of problems. To my surprise, there don't seem to exist any. Eventually I came across this reddit post, where somebody asks about how to solve Einstein's riddle in Python, and somebody in the comments points out this rather recent video of a talk by Raymond Hettinger about how you can use Python to solve well-defined problems. The interesting part is when he starts to talk about SAT solvers. Here is the link to the relevant part of his slides. These kind of solvers are exactly what you need for your problem. We are very lucky that he made all this information so accessible (here is the github repo), because otherwise, I probably would not have been able to come up with a solver adjusted to your problem.
For some reason, there is one crucial part missing in the documentation though, which are the custom convenience functions he came up with. They were pretty hard to find, at some point, I was even thinking about sending him a message somehow, and ask for where to find them. Fortunately, that wasn't necessary, because eventually, I found them here.
With all the tools in place, I only needed to modify the solver according to your problem. The most difficult part was to come up with / translate the rules that determine how the order is retrieved from your dataset. There is really only one rule though, which is the consecutive rule, and kinda goes like this:
For any consecutive item pair in your dataset, in the order, the second item can only be placed at positions after the first one. There can be other items in between them, but there don't have to be.
It took quite some trial and error to get the implementation right.
Finally, here is my version of the SAT solver that solves your problem:
import numpy as np
import itertools
from pprint import pprint
from sys import intern
from sat_utils import solve_one, from_dnf, one_of
datasets = [
['b', 'd'], ['b', 'd', 'e', 'f'], ['a', 'b', 'd', 'f'], ['b', 'd', 'e'],
['d', 'e', 'f'], ['f'], ['d', 'f']
]
values = set(itertools.chain.from_iterable(datasets))
positions = np.arange(len(values))
order = np.empty(positions.size, dtype=object)
def comb(value, position):
"""Format how a value is shown at a given position"""
return intern(f'{value} {position}')
def found_at(value, position):
"""Value known to be at a specific position"""
return [(comb(value, position),)]
def consecutive(value1, value2):
"""
The values are in consecutive positions: a & b | a & _ & b |
a & _ & _ & b ...
"""
lst = [
(comb(value1, i), comb(value2, j+i+1)) for i in range(len(positions))
for j in range(len(positions[i+1:]))
]
return from_dnf(lst)
cnf = []
# each value gets assigned to exactly one position
for value in values:
cnf += one_of(comb(value, position) for position in positions)
# for each consecutive value pair, add all potential successions
for subset in datasets:
for value1, value2 in zip(subset, subset[1:]):
cnf += consecutive(value1, value2)
solution = solve_one(cnf)
for pair in solution:
value, position = pair.split()
order[int(position)] = value
pprint(order) # array(['a', 'b', 'd', 'e', 'f'], dtype=object)
I am sure this can still be optimized here and there, and I know the code is a lot longer than your solution, but I think from a technical standpoint, using a SAT solver is a very nice approach. And if you believe what Raymond Hettinger says in his talk, this is also pretty fast and scales very well.
Please make sure to test this thoroughly though, because I cannot guaratee that I didn't make any mistakes.
As a side note: In order to determine the unique items of your sample dataset, I used the nice trick pointed out in the comments here.
x = [['b', 'd'], ['b', 'd', 'e', 'f], ['a', 'b', 'd', 'f], ['b', 'd', 'e'], [['d', 'e', 'f'], ['f'], ['d', 'f']]
for l in x:
l.sort()
#if you want all the elements inside in order...elements should be of the same type
q = set()
for l in x:
q.update(l)
s = list(q)
s.sort()

Unwanted side effects when flattening a list [duplicate]

This question already has answers here:
Content of list change unexpected in Python 3
(2 answers)
Closed 2 years ago.
I was just testing some algorithms to flatten a list, so I created 3 lists inside a list, and then tried to flatten it. I never touch the original list, the variables are named different, but when I try to see the original list, it has been modified, any idea why this is happening?
In [63]: xxx = [['R', 'L', 'D'], ['U', 'O', 'E'], ['C', 'S', 'O']]
In [64]: def flat_ind(lst):
...: one = lst[0]
...: for l in lst[1:]:
...: one += l
...: return one
...:
In [65]: flat = flat_ind(xxx)
In [66]: flat
Out[66]: ['R', 'L', 'D', 'U', 'O', 'E', 'C', 'S', 'O']
In [67]: xxx
Out[67]:
[['R', 'L', 'D', 'U', 'O', 'E', 'C', 'S', 'O'],
['U', 'O', 'E'],
['C', 'S', 'O']]
I understand that one is still pointing to the original lst and that is the reason it is modifying it, but still, I though that, since this was inside a function, it would not happen, more importantly
how do I make this not happen?
Thanks!
"I understand that one is still pointing to the original lst and that is the reason it is modifying it, but still, I though that, since this was inside a function, it would not happen,"
That doesn't make any sense. It doesn't matter where you mutate an object, it will still be mutated.
In any case, the mutation occurs because of this:
one += l
which is an in-place modification. You could use
one = on + l
instead, but that would be highly inefficient. As others have pointed out, you could just copy that first list,
one = lst[0][:]
But the idiomatic way to flatten a regularly nested list like this is to simply:
flat = [x for sub in xxx for x in sub]
Or,
from itertools import chain
flat = list(chain.from_iterable(xxx))

Creating an irregular list of lists from a single list

I'm trying to create a list of lists from a single list. I'm able to do this if the new list of lists have the same number of elements, however this will not always be the case
As said earlier, the function below works when the list of lists have the same number of elements.
I've tried using regular expressions to determine if an element matches a pattern using
pattern2=re.compile(r'\d\d\d\d\d\d') because the first value on my new list of lists will always be 6 digits and it will be the only one that follows that format. However, i'm not sure of the syntax of getting it to stop at the next match and create another list
def chunks(l,n):
for i in range(0,len(l),n):
yield l[i:i+n]
The code above works if the list of lists will contain the same number of elements
Below is what I expect.
OldList=[111111,a,b,c,d,222222,a,b,c,333333,a,d,e,f]
DesiredList=[[111111,a,b,c,d],[222222,a,b,c],[333333,a,d,e,f]]
Many thanks indeed.
Cheers
Likely a much more efficient way to do this (with fewer loops), but here is one approach that finds the indexes of the breakpoints and then slices the list from index to index appending None to the end of the indexes list to capture the remaining items. If your 6 digit numbers are really strings, then you could eliminate the str() inside re.match().
import re
d = [111111,'a','b','c','d',222222,'a','b','c',333333,'a','d','e','f']
indexes = [i for i, x in enumerate(d) if re.match(r'\d{6}', str(x))]
groups = [d[s:e] for s, e in zip(indexes, indexes[1:] + [None])]
print(groups)
# [[111111, 'a', 'b', 'c', 'd'], [222222, 'a', 'b', 'c'], [333333, 'a', 'd', 'e', 'f']]
You can use a fold.
First, define a function to locate the start flag:
>>> def is_start_flag(v):
... return len(v) == 6 and v.isdigit()
That will be useful if the flags are not exactly what you expected them to be, or to exclude some false positives, or even if you need a regex.
Then use functools.reduce:
>>> L = d = ['111111', 'a', 'b', 'c', 'd', '222222', 'a', 'b', 'c', '333333', 'a', 'd', 'e', 'f']
>>> import functools
>>> functools.reduce(lambda acc, x: acc+[[x]] if is_start_flag(x) else acc[:-1]+[acc[-1]+[x]], L, [])
[['111111', 'a', 'b', 'c', 'd'], ['222222', 'a', 'b', 'c'], ['333333', 'a', 'd', 'e', 'f']]
If the next element x is the start flag, then append a new list [x] to the accumulator. Else, add the element to the current list, ie the last list of the accumulator.

Getting specific indexed distinct values in nested lists

I have a nested list of around 1 million records like:
l = [['a', 'b', 'c', ...], ['d', 'b', 'e', ...], ['f', 'z', 'g', ...],...]
I want to get the distinct values of inner lists on second index, so that my resultant list be like:
resultant = ['b', 'z', ...]
I have tried nested loops but its not fast, any help will be appreciated!
Since you want the unique items you can use collections.OrderedDict.fromkeys() in order to keep the order and unique items (because of using hashtable fro keys) and use zip() to get the second items.
from collections import OrderedDict
list(OrderedDict.fromkeys(zip(my_lists)[2]))
In python 3.x since zip() returns an iterator you can do this:
colls = zip(my_lists)
next(colls)
list(OrderedDict.fromkeys(next(colls)))
Or use a generator expression within dict.formkeys():
list(OrderedDict.fromkeys(i[1] for i in my_lists))
Demo:
>>> lst = [['a', 'b', 'c'], ['d', 'b', 'e'], ['f', 'z', 'g']]
>>>
>>> list(OrderedDict().fromkeys(sub[1] for sub in lst))
['b', 'z']
You can unzip the list of lists then choice the second tuple with set like below :
This code take 4.05311584473e-06 millseconds, in my laptop
list(set(zip(*lst)[1]))
Input :
lst = [['a', 'b', 'c'], ['d', 'b', 'e'], ['f', 'z', 'g']]
Out put :
['b', 'z']
Would that work for you?
result = set([inner_list[1] for inner_list in l])
I can think of two options.
Set comprehension:
res = {x[1] for x in l}
I think numpy arrays work faster than list/set comprehensions, so converting this list to an array and then using array functions can be faster. Here:
import numpy as np
res = np.unique(np.array(l)[:, 1])
Let me explain: np.array(l) converts the list to a 2d array, then [:, 1] take the second column (starting to count from 0) which consists of the second item of each sublist in the original l, and finally taking only unique values using np.unique.

Categories