How to recursively combine pairs of elements from a list? - python

I am attempting to deduplicate some pandas DataFrames, and I have a function that does this pair-wise (i.e. two dfs at a time). I want to write another function that takes a list of DataFrames of arbitrary length and combines the first two elements in the list, then combines the result with the third element in the list until we reach the end of the list.
For simplicity, I'll assume my deduplication function is simply string concatenation.
I tried some recursive functions, but it's not quite correct.
def dedupe_recursive(input_list):
if input_list == []:
return
else:
for i in range(0, len(input_list)-1):
new_list = input_list[i+1:]
deduped = dedupe(new_list[i], new_list[i+1])
print(deduped, new_list)
return dedupe_recursive(new_list)
Input (list): ['a', 'b', 'c', 'd']
Output (list of lists): [['ab'], ['ab', 'c'], ['abc', 'd']]

There's a function for exactly this kind of thing, it's called reduce. You would use it like this:
from functools import reduce
final_df = reduce(dedupe, list_of_dataframes)

Related

resolved Split list into parallel lists using python

I would like to create a parallel list from my original list and then use sort_together
original = ['4,d', '3,b']
parallel list should create 2 lists like this:
lis1 = ['4', '3']
list2 = ['d', 'b']
I've tried using split but was only able to obtain a single list :(
[i.split(",", 1) for i in original]
You can use the zip(*...) trick together with .split:
list1, list2 = zip(*(x.split(",") for x in original))
Now this actually gives you two tuples instead of lists but that should be easy to fix if you really need lists.
You can use map and zip:
lis1, list2 = zip(*map(lambda x:x.split(","), original))
map will apply the function, passed as first argument (in this case, it simply splits strings on the comma separator) to every element of the iterable (list in this case) passed as second argument. After this, you'll have a map object which contains ['4', 'd'] and ['3', 'b']
the zip operator takes two (or more) lists and puts them side by side (like a physical zip would do), creating a lists for elements next to each other. For example, list(zip([1,2,3],[4,5,6])) is [[1.4],[2,5],[3.6]].
The unpacking * is necessary given that you want to pass the two sublists in the returned map object.
ini_list = [[4,'d'], [3,'b']]
print ("initial list", str(ini_list))
res1, res2 = map(list, zip(*ini_list))
print("final lists", res1, "\n", res2)
you can use this code for sort and get two list
list1 , list2 = list(sorted(t) for t in (zip(*(item.split(',') for item in orignal))))
for i in range(2):
s[i] = sorted([oi.split(',')[i] for oi in o])

Finding all the elements in a list between two elements (not using index, and with wrap around)

I'm trying to figure out a way to find all the elements that appear between two list elements (inclusive) - but to do it without reference to position, and instead with reference to the elements themselves. It's easier to explain with code:
I have a list like this:
['a','b','c','d','e']
And I want a function that would take, two arguments corresponding to elements eg. f('a','d'), and return the following:
['a','b','c','d']
I'd also like it to wrap around, eg. f('d','b'):
['d','e','a','b']
I'm not sure how to go about coding this. One hacky way I've thought of is duplicating the list in question (['a','b','c','d','e','a','b','c','d','e']) and then looping through it and flagging when the first element appears and when the last element does and then discarding the rest - but it seems like there would be a better way. Any suggestions?
def foo(a, b):
s, e = [a.index(x) for x in b]
if s <= e:
return a[s:e+1]
else:
return a[s:] + a[:e+1]
print(foo(['a','b','c','d','e'], ['a', 'd'])) # --> ['a', 'b', 'c', 'd']
print(foo(['a','b','c','d','e'], ['d', 'b'])) # --> ['d', 'e', 'a', 'b']
So the following obviously needs error handling as indicated below, and also, note the the index() function only takes the index of the first occurrence. You have not specified how you want to handle duplicate elements in the list.
def f(mylist, elem1, elem2):
posn_first = mylist.index(elem1) # what if it's not in the list?
posn_second = mylist.index(elem2) # ditto
if (posn_first <= posn_second):
return mylist[posn_first:posn_second+1]
else:
return mylist[posn_first:] + mylist[:posn_second+1]
This would be a simple approach, given you always want to use the first appearence of the element in the list:
def get_wrapped_values(input_list, start_element, end_element):
return input_list[input_list.index(start_element): input_list.index(end_element)+1]

how to get the subset of list without repetition in python?

I want to get the subset of the list without repetition of any elements in python.
For example, Mylist=[['A','B'.'C','D','E']]
I want to get the output like [['A','B'.'C'],['D','E','A']]
I see that what you need is a list of permutations of your list.
But your list should not include nested lists, so change the way
you defined it to:
Mylist=['A','B','C','D','E']
(in single brackets).
To print a full list of permutations, you can use itertools.permutations,
(import itertools needed), e.g.
pLen = 3 # How many elements in each list
for prm in itertools.permutations(Mylist, pLen):
print(list(prm))
Note that I used casting to list type, as itertools.permutations
returns tuples.
But if you want only a limited number of such permutations,
you can achieve it using another itertools function, namely islice:
pLen = 3 # How many elements in each list
listLen = 4 # How many lists
result = [ list(prm) for prm in (itertools.islice(
itertools.permutations(Mylist, pLen), listLen)) ]
The result is:
[['A', 'B', 'C'], ['A', 'B', 'D'], ['A', 'B', 'E'], ['A', 'C', 'B']]
Other approach, without itertools:
import random
pLen = 3 # How many elements in each list
listLen = 4 # How many lists
result = []
wrk = Mylist.copy()
for i in range(listLen):
random.shuffle(wrk)
result.append(wrk[:pLen])

Elegant slicing in python list based on index

I was wondering what would be an efficient an elegant way of slicing a python list based on the index. In order to provide a minimal example:
temp = ['a','b','c','d']
index_needed=[0,2]
How can I slice the list without the loop?
expected output
output_list =['a','c']
I have a sense that there would be a way but haven't figured out any. Any suggestions?
First, note that indexing in Python begins at 0. So the indices you need will be [0, 2].
You can then use a list comprehension:
temp = ['a', 'b', 'c', 'd']
idx = [0, 2]
res = [temp[i] for i in idx] # ['a', 'c']
With built-ins, you may find map performs better:
res = map(temp.__getitem__, idx) # ['a', 'c']
Since you are using Python 2.7, this returns a list. For Python 3.x, you would need to pass the map object to list.
If you are looking to avoid a Python-level loop altogether, you may wish to use a 3rd party library such as NumPy:
import numpy as np
temp = np.array(['a', 'b', 'c', 'd'])
res = temp[idx]
# array(['a', 'c'],
# dtype='<U1')
res2 = np.delete(temp, idx)
# array(['b', 'd'],
# dtype='<U1')
This returns a NumPy array, which you can then be converted to a list via res.tolist().
Use this :
temp = ['a','b','c','d']
temp[0:4:2]
#Output
['a', 'c']
Here first value is starting index number which is (Included) second value is ending index number which is (Excluded) and third value is (steps) to be taken.
Happy Learning...:)
An alternative that pushes the work to the C layer on CPython (the reference interpreter):
from operator import itemgetter
temp = ['a','b','c','d']
index_needed=[0,2]
output_list = itemgetter(*index_needed)(temp)
That returns tuple of the values; if list is necessary, just wrap in the list constructor:
output_list = list(itemgetter(*index_needed)(temp))
Note that this only works properly if you need at least two indices; itemgetter is variable return type based on how it's initialized, returning the value directly when it's passed a single key to pull, and a tuple of values when passed more than one key.
It's also not particularly efficient for one-off uses. A more common use case would be if you had an iterable of sequences (typically tuples, but any sequence works), and don't care about them. For example, with an input list of:
allvalues = [(1, 2, 3, 4),
(5, 6, 7, 8)]
if you only wanted the values from index 1 and 3, you could write a loop like:
for _, x, _, y in allvalues:
where you unpack all the values but send the ones you don't care about to _ to indicate the lack of interest, or you can use itemgetter and map to strip them down to what you care about before the unpack:
from future_builtins import map # Because Py2's map is terrible; not needed on Py3
for x, y in map(itemgetter(1, 3), allvalues):
The itemgetter based approach doesn't care if you have more than four items in a given element of allvalues, while manual unpacking would always require exactly four; which is better is largely based on your use case.

Categorize list in Python

What is the best way to categorize a list in python?
for example:
totalist is below
totalist[1] = ['A','B','C','D','E']
totalist[2] = ['A','B','X','Y','Z']
totalist[3] = ['A','F','T','U','V']
totalist[4] = ['A','F','M','N','O']
Say I want to get the list where the first two items are ['A','B'], basically list[1] and list[2]. Is there an easy way to get these without iterate one item at a time? Like something like this?
if ['A','B'] in totalist
I know that doesn't work.
You could check the first two elements of each list.
for totalist in all_lists:
if totalist[:2] == ['A', 'B']:
# Do something.
Note: The one-liner solutions suggested by Kasramvd are quite nice too. I found my solution more readable. Though I should say comprehensions are slightly faster than regular for loops. (Which I tested myself.)
Just for fun, itertools solution to push per-element work to the C layer:
from future_builtins import map # Py2 only; not needed on Py3
from itertools import compress
from operator import itemgetter
# Generator
prefixes = map(itemgetter(slice(2)), totalist)
selectors = map(['A','B'].__eq__, prefixes)
# If you need them one at a time, just skip list wrapping and iterate
# compress output directly
matches = list(compress(totalist, selectors))
This could all be one-lined to:
matches = list(compress(totalist, map(['A','B'].__eq__, map(itemgetter(slice(2)), totalist))))
but I wouldn't recommend it. Incidentally, if totalist might be a generator, not a re-iterable sequence, you'd want to use itertools.tee to double it, adding:
totalist, forselection = itertools.tee(totalist, 2)
and changing the definition of prefixes to map over forselection, not totalist; since compress iterates both iterators in parallel, tee won't have meaningful memory overhead.
Of course, as others have noted, even moving to C, this is a linear algorithm. Ideally, you'd use something like a collections.defaultdict(list) to map from two element prefixes of each list (converted to tuple to make them legal dict keys) to a list of all lists with that prefix. Then, instead of linear search over N lists to find those with matching prefixes, you just do totaldict['A', 'B'] and you get the results with O(1) lookup (and less fixed work too; no constant slicing).
Example precompute work:
from collections import defaultdict
totaldict = defaultdict(list)
for x in totalist:
totaldict[tuple(x[:2])].append(x)
# Optionally, to prevent autovivification later:
totaldict = dict(totaldict)
Then you can get matches effectively instantly for any two element prefix with just:
matches = totaldict['A', 'B']
You could do this.
>>> for i in totalist:
... if ['A','B']==i[:2]:
... print i
Basically you can't do this in python with a nested list. But if you are looking for an optimized approach here are some ways:
Use a simple list comprehension, by comparing the intended list with only first two items of sub lists:
>>> [sub for sub in totalist if sub[:2] == ['A', 'B']]
[['A', 'B', 'C', 'D', 'E'], ['A', 'B', 'X', 'Y', 'Z']]
If you want the indices use enumerate:
>>> [ind for ind, sub in enumerate(totalist) if sub[:2] == ['A', 'B']]
[0, 1]
And here is a approach in Numpy which is pretty much optimized when you are dealing with large data sets:
>>> import numpy as np
>>>
>>> totalist = np.array([['A','B','C','D','E'],
... ['A','B','X','Y','Z'],
... ['A','F','T','U','V'],
... ['A','F','M','N','O']])
>>> totalist[(totalist[:,:2]==['A', 'B']).all(axis=1)]
array([['A', 'B', 'C', 'D', 'E'],
['A', 'B', 'X', 'Y', 'Z']],
dtype='|S1')
Also as an alternative to list comprehension in python if you don't want to use a loop and you are looking for a functional way, you can use filter function, which is not as optimized as a list comprehension:
>>> list(filter(lambda x: x[:2]==['A', 'B'], totalist))
[['A', 'B', 'C', 'D', 'E'], ['A', 'B', 'X', 'Y', 'Z']]
You imply that you are concerned about performance (cost). If you need to do this, and if you are worried about performance, you need a different data-structure. This will add a little "cost" when you making the lists, but save you time when filtering them.
If the need to filter based on the first two elements is fixed (it doesn't generalise to the first n elements) then I would add the lists, as they are made, to a dict where the key is a tuple of the first two elements, and the item is a list of lists.
then you simply retrieve your list by doing a dict lookup. This is easy to do and will bring potentially large speed ups, at almost no cost in memory and time while making the lists.

Categories