Python equivalent of R "split"-function - python

In R, you could split a vector according to the factors of another vector:
> a <- 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> b <- rep(1:2,5)
[1] 1 2 1 2 1 2 1 2 1 2
> split(a,b)
$`1`
[1] 1 3 5 7 9
$`2`
[1] 2 4 6 8 10
Thus, grouping a list (in terms of python) according to the values of another list (according to the order of the factors).
Is there anything handy in python like that, except from the itertools.groupby approach?

From your example, it looks like each element in b contains the 1-indexed list in which the node will be stored. Python lacks the automatic numeric variables that R seems to have, so we'll return a tuple of lists. If you can do zero-indexed lists, and you only need two lists (i.e., for your R use case, 1 and 2 are the only values, in python they'll be 0 and 1)
>>> a = range(1, 11)
>>> b = [0,1] * 5
>>> split(a, b)
([1, 3, 5, 7, 9], [2, 4, 6, 8, 10])
Then you can use itertools.compress:
def split(x, f):
return list(itertools.compress(x, f)), list(itertools.compress(x, (not i for i in f)))
If you need more general input (multiple numbers), something like the following will return an n-tuple:
def split(x, f):
count = max(f) + 1
return tuple( list(itertools.compress(x, (el == i for el in f))) for i in xrange(count) )
>>> split([1,2,3,4,5,6,7,8,9,10], [0,1,1,0,2,3,4,0,1,2])
([1, 4, 8], [2, 3, 9], [5, 10], [6], [7])

Edit: warning, this a groupby solution, which is not what OP asked for, but it may be of use to someone looking for a less specific way to split the R way in Python.
Here's one way with itertools.
import itertools
# make your sample data
a = range(1,11)
b = zip(*zip(range(len(a)), itertools.cycle((1,2))))[1]
{k: zip(*g)[1] for k, g in itertools.groupby(sorted(zip(b,a)), lambda x: x[0])}
# {1: (1, 3, 5, 7, 9), 2: (2, 4, 6, 8, 10)}
This gives you a dictionary, which is analogous to the named list that you get from R's split.

As a long time R user I was wondering how to do the same thing. It's a very handy function for tabulating vectors. This is what I came up with:
a = [1,2,3,4,5,6,7,8,9,10]
b = [1,2,1,2,1,2,1,2,1,2]
from collections import defaultdict
def split(x, f):
res = defaultdict(list)
for v, k in zip(x, f):
res[k].append(v)
return res
>>> split(a, b)
defaultdict(list, {1: [1, 3, 5, 7, 9], 2: [2, 4, 6, 8, 10]})

You could try:
a = [1,2,3,4,5,6,7,8,9,10]
b = [1,2,1,2,1,2,1,2,1,2]
split_1 = [a[k] for k in (i for i,j in enumerate(b) if j == 1)]
split_2 = [a[k] for k in (i for i,j in enumerate(b) if j == 2)]
results in:
In [22]: split_1
Out[22]: [1, 3, 5, 7, 9]
In [24]: split_2
Out[24]: [2, 4, 6, 8, 10]
To make this generalise you can simply iterate over the unique elements in b:
splits = {}
for index in set(b):
splits[index] = [a[k] for k in (i for i,j in enumerate(b) if j == index)]

Related

How to find column numbers in increasing order

I have a pandas dataframe, with a column containing item numbers that are supposed to increase by 1, each row.
df1 = pd.DataFrame({
"item_number" : [1, 2, 3, 4, 5, 6, 8, 10],
"col_A" : ['aaa','bbb','ccc','ddd','eee','fff','hhh', 'jjj']})
df1
item_number col_A
0 1 aaa
1 2 bbb
2 3 ccc
3 4 ddd
4 5 eee
5 6 fff
6 8 hhh
7 10 jjj
As you can see, the item number increases by two between 6 and 8 and 8 and 10. Is there a way to write a function that will a list of the skipped numbers ie. ['7','9'] otherwise, return True
s=pd.Series(range(df['item_number'].min(), (df['item_number'].max()+1)))
s[~s.isin(df['item_number'])].values
array([7, 9], dtype=int64)
one-liner:
set(range(df1.item_number.min(), df1.item_number.max()+1)) - set(df1.item_number) or True
You can take advantage of Python sets and lists operations to find out if the condition you are proposing meets on the input list:
li = [1, 2, 3, 4, 5, 6, 8, 10]
def fun(l):
a = list(set(list(range(l[0], l[-1]+1))) - set(l))
if a == []:
return True
else:
return a
print(fun(li))
Output:
[9, 7]
Also, you can use return sorted(a) if you want the list elements to be returned in order.
Use range with np.setdiff1d:
In [1518]: import numpy as np
In [1519]: rng = range(df1.item_number.min(), df1.item_number.max() + 1)
In [1523]: res = np.setdiff1d(rng, df1.item_number)
In [1524]: res
Out[1524]: array([7, 9])
This will do it:
def foo(df):
x = df.set_index('item_number').reindex(range(df.item_number.min(), df.item_number.max() + 1))
x = list(x.index[x.col_A.isna()])
return x if x else True
Examples:
y = foo(df1)
print(y)
y = foo(df1.loc[range(1, 6)])
print(y)
Output:
[7, 9]
True

Python adding a list to a slice of another list

Here's basic problem:
>>> listb = [ 1, 2, 3, 4, 5, 6, 7 ]
>>> slicea = slice(2,5)
>>> listb[slicea]
[3, 4, 5]
>>> lista = listb[slicea]
>>> lista
[3, 4, 5]
>>> listb[slicea] += lista
>>> listb
[1, 2, 3, 4, 5, 3, 4, 5, 6, 7]
listb should be
[1, 2, 6, 8, 10, 6, 7]
But 3, 4, 5 was inserted after 3, 4, 5 not added to it.
tl;dr
I have this code that's not working:
self.lib_tree.item(song)['values'][select_values] = adj_list
self.lib_tree.item(album)['values'][select_values] += adj_list
self.lib_tree.item(artist)['values'][select_values] += adj_list
The full code is this:
def toggle_select(self, song, album, artist):
# 'values' 0=Access, 1=Size, 2=Selected Size, 3=StatTime, 4=StatSize,
# 5=Count, 6=Seconds, 7=SelSize, 8=SelCount, 9=SelSeconds
# Set slice to StatSize, Count, Seconds
total_values = slice(4, 7) # start at index, stop before index
select_values = slice(7, 10) # start at index, stop before index
tags = self.lib_tree.item(song)['tags']
if "songsel" in tags:
# We will toggle off and subtract from selected parent totals
tags.remove("songsel")
self.lib_tree.item(song, tags=(tags))
# Get StatSize, Count and Seconds
adj_list = [element * -1 for element in \
self.lib_tree.item(song)['values'][total_values]]
else:
tags.append("songsel")
self.lib_tree.item(song, tags=(tags))
# Get StatSize, Count and Seconds
adj_list = self.lib_tree.item(song)['values'][total_values] # 1 past
self.lib_tree.item(song)['values'][select_values] = adj_list
self.lib_tree.item(album)['values'][select_values] += adj_list
self.lib_tree.item(artist)['values'][select_values] += adj_list
if self.debug_toggle < 10:
self.debug_toggle += 1
print('artist,album,song:',self.lib_tree.item(artist, 'text'), \
self.lib_tree.item(album, 'text'), \
self.lib_tree.item(song, 'text'))
print('adj_list:',adj_list)
The adj_list has the correct values showing up in debug.
How do I add a list of values to the slice of a list?
The behavior you want is not a feature of any Python built-in type; + with built-in sequences means concatenation, not element-wise addition. But numpy arrays will do what you want, so I'd suggest looking into numpy. Simple example:
>>> import numpy as np
>>> a = np.array([2,3,4], dtype=np.int64)
>>> b = np.array([5,6,7], dtype=np.int64)
>>> a += b
>>> a
array([ 7, 9, 11])
>>> print(a)
[ 7 9 11]
>>> print(a.tolist())
[7, 9, 11]
Note that the output (both repr and str forms) looks a little different from Python lists, but you can convert back to a plain Python list if needed.

Various list concatenation method and their performance

I was working on an algorithm and in that, we are trying to write every line in the code such that it adds up a good performance to the final code.
In one situation we have to add lists (more than two specifically). I know some of the ways to join more than two lists also I have looked upon StackOverflow but none of the answers are giving account on the performance of the method.
Can anyone show, what are the ways we can join more than two lists and their respective performance?
Edit : The size of the list is varying from 2 to 13 (to be specific).
Edit Duplicate : I have been specifically asking for the ways we can add and their respected questions and in duplicate question its limited to only 4 methods
There are multiples ways using which you can join more than two list.
Assuming that we have three list,
a = ['1']
b = ['2']
c = ['3']
Then, for joining two or more lists in python,
1)
You can simply concatenate them,
output = a + b + c
2)
You can do it using list comprehension as well,
res_list = [y for x in [a,b,c] for y in x]
3)
You can do it using extend() as well,
a.extend(b)
a.extend(c)
print(a)
4)
You can do it by using * operator as well,
res = [*a,*b,*c]
For calculating performance, I have used timeit module present in python.
The performance of the following methods are;
4th method < 1st method < 3rd method < 2nd [method on the basis of
time]
That means If you are going to use " * operator " for concatenation of more than two lists then you will get the best performance.
Hope you got what you were looking for.
Edit:: Image showing performance of all the methods (Calculated using timeit)
I did some simple measurements, here are my results:
import timeit
from itertools import chain
a = [*range(1, 10)]
b = [*range(1, 10)]
c = [*range(1, 10)]
tests = ("""output = list(chain(a, b, c))""",
"""output = a + b + c""",
"""output = [*chain(a, b, c)]""",
"""output = a.copy();output.extend(b);output.extend(c);""",
"""output = [*a, *b, *c]""",
"""output = a.copy();output+=b;output+=c;""",
"""output = a.copy();output+=[*b, *c]""",
"""output = a.copy();output += b + c""")
results = sorted((timeit.timeit(stmt=test, number=1, globals=globals()), test) for test in tests)
for i, (t, stmt) in enumerate(results, 1):
print(f'{i}.\t{t}\t{stmt}')
Prints on my machine (AMD 2400G, Python 3.6.7):
1. 6.010000106471125e-07 output = [*a, *b, *c]
2. 7.109999842214165e-07 output = a.copy();output += b + c
3. 7.720000212430023e-07 output = a.copy();output+=b;output+=c;
4. 7.820001428626711e-07 output = a + b + c
5. 1.0520000159885967e-06 output = a.copy();output+=[*b, *c]
6. 1.4030001693754457e-06 output = a.copy();output.extend(b);output.extend(c);
7. 1.4820000160398195e-06 output = [*chain(a, b, c)]
8. 2.525000127207022e-06 output = list(chain(a, b, c))
If you are going to concatenate a variable number of lists together, your input is going to be a list of lists (or some equivalent collection). The performance tests need to take this into account because you are not going to be able to do things like list1+list2+list3.
Here are some test results (1000 repetitions):
option1 += loop 0.00097 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option2 itertools.chain 0.00138 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option3 functools.reduce 0.00174 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option4 comprehension 0.00188 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option5 extend loop 0.00127 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
option6 deque 0.00180 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4]
This would indicate that a += loop through the list of lists is the fastest approach
And the source to produce them:
allLists = [ list(range(10)) for _ in range(5) ]
def option1():
result = allLists[0].copy()
for lst in allLists[1:]:
result += lst
return result
from itertools import chain
def option2(): return list(chain(*allLists))
from functools import reduce
def option3():
return list(reduce(lambda a,b:a+b,allLists))
def option4(): return [ e for l in allLists for e in l ]
def option5():
result = allLists[0].copy()
for lst in allLists[1:]:
result.extend(lst)
return result
from collections import deque
def option6():
result = deque()
for lst in allLists:
result.extend(lst)
return list(result)
from timeit import timeit
count = 1000
t = timeit(lambda:option1(), number = count)
print(f"option1 += loop {t:.5f}",option1()[:15])
t = timeit(lambda:option2(), number = count)
print(f"option2 itertools.chain {t:.5f}",option2()[:15])
t = timeit(lambda:option3(), number = count)
print(f"option3 functools.reduce {t:.5f}",option3()[:15])
t = timeit(lambda:option4(), number = count)
print(f"option4 comprehension {t:.5f}",option4()[:15])
t = timeit(lambda:option5(), number = count)
print(f"option5 extend loop {t:.5f}",option5()[:15])
t = timeit(lambda:option6(), number = count)
print(f"option6 deque {t:.5f}",option6()[:15])

Convert list into table - python

I have two arrays. column_names hold the column titles. values hold all the values.
I understand if I do this:
column_names = ["a", "b", "c"]
values = [1, 2, 3]
for n, v in zip(column_names, values):
print("{} = {}".format(n, v))
I get
a = 1
b = 2
c = 3
How do I code it so if I pass:
column_names = ["a", "b", "c"]
values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
I would get
a = 1, 4, 7
b = 2, 5, 8
c = 3, 6, 9
Thank you!
With pandas and numpy it is easy and the result will be a much more useful table. Pandas excels at arranging tabular data. So lets take advantage of it:
install pandas with:
pip install pandas --user
#pandas comes with numpy
import numpy as np
import pandas as pd
# this makes a normal python list for integers 1-9
input = list(range(1,10))
#lets convert that to numpy array as np.array
num = np.array(input)
#currently its shape is single dimensional, lets change that to a two dimensional matrix that turns it into the clean breaks you want
reshaped = num.reshape(3,3)
#now construct a beautiful table
pd.DataFrame(reshaped, columns=['a','b','c'])
#ouput is
a b c
0 1 2 3
1 4 5 6
2 7 8 9
You can do it as follows
>>> for n, v in zip(column_names, zip(*[values[i:i+3] for i in range(0,len(values),3)])):
... print("{} = {}".format(n, ', '.join(map(str, v))))
...
a = 1, 4, 7
b = 2, 5, 8
c = 3, 6, 9
Alternatively, you can use grouper defined in itertools
>>> def grouper(iterable, n, fillvalue=None):
... "Collect data into fixed-length chunks or blocks"
... # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
... args = [iter(iterable)] * n
... return zip_longest(*args, fillvalue=fillvalue)
...
>>> from itertools import zip_longest
>>> for n, v in zip(column_names, zip(*grouper(values, 3))):
... print("{} = {}".format(n, ', '.join(map(str, v))))
...
a = 1, 4, 7
b = 2, 5, 8
c = 3, 6, 9
itertools.cycle seems appropriate in this case. Here's another version for future readers:
import itertools
column_names = ["a", "b", "c"]
values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
L = zip(itertools.cycle(column_names), values)
for g, v in itertools.groupby(sorted(L), lambda x: x[0]):
print("{} = {}".format(g, [i[1] for i in v]))
gives:
a = [1, 4, 7]
b = [2, 5, 8]
c = [3, 6, 9]
This has two sub-steps that you want to do.
First, you want to divide your list into chunks, and then you want to assign those chunks to a dictionary.
To split the list into chunks, we can create a function:
def chunk(values, chunk_size):
assert len(values)%chunk_size == 0 # Our chunk size has to evenly fit in our list
steps = len(values)/chunk_size
chunky_list = []
for i in range(0,steps):
position = 0 + i
sub_list = []
while position < len(values):
sub_list.append(values[position])
position += chunk_size
chunky_list.append(sub_list)
return chunky_list
At this point we will have:
[[1,4,7],[2,5,8],[3,6,9]]
From here, creating the dict is really easy. First, we zip the two lists together:
zip(column_names, chunk(3))
And take advantage of the fact that Python knows how to convert a list of tuples into a dictionary:
dict(zip(column_names, chunk(3)))
You can also use slicing and a collections.defaultdict to collect your values:
from collections import defaultdict
column_names = ["a", "b", "c"]
values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
column_len = len(column_names)
d = defaultdict(list)
for i in range(0, len(values), column_len):
seq = values[i:i+column_len]
for idx, number in enumerate(seq):
d[column_names[idx]].append(number)
for k, v in d.items():
print('%s = %s' % (k, ', '.join(map(str, v))))
Which Outputs:
a = 1, 4, 7
b = 2, 5, 8
c = 3, 6, 9
This can be imporoved if you create zipped lists with itertools.cycle, avoiding the slicing all together:
from collections import defaultdict
from itertools import cycle
column_names = ["a", "b", "c"]
values = [1, 2, 3, 4, 5, 6, 7, 8, 9]
column_names = cycle(column_names)
d = defaultdict(list)
for column, val in zip(column_names, values):
d[column].append(val)
for k, v in d.items():
print('%s = %s' % (k, ', '.join(map(str, v))))

Sum integer list when next integer is the same value

So I need to have a code that checks one integer, and checks if the integer after it is the same value. If so, it will add the value to x.
input1 = [int(i) for i in str(1234441122)]
x= 0
So my code currently gives the result [1, 2, 3, 4, 4, 4, 1, 1 ,2 ,2]. I want it to give the result of x = 0+4+4+1+2.
I do not know any way to do that.
The following will work. Zip together adjacent pairs and only take the first elements if they are the same as the second ones:
>>> lst = [1, 2, 3, 4, 4, 4, 1, 1, 2, 2]
>>> sum(x for x, y in zip(lst, lst[1:]) if x == y)
11
While this should be a little less [space-]efficent in theory (as the slice creates an extra list), it still has O(N) complexity in time and space and is well more readable than most solutions based on indexed access. A tricky way to avoid the slice while still being concise and avoiding any imports would be:
>>> sum((lst[i] == lst[i-1]) * lst[i] for i in range(1, len(lst))) # Py2: xrange
11
This makes use of the fact that lst[i]==lst[i-1] will be cast to 0 or 1 appropriately.
Another way using itertools.groupby
l = [1, 2, 3, 4, 4, 4, 1, 1 ,2 ,2]
from itertools import groupby
sum(sum(g)-k for k,g in groupby(l))
#11
You can try this:
s = str(1234441122)
new_data = [int(a) for i, a in enumerate(s) if i+1 < len(s) and a == s[i+1]]
print(new_data)
final_data = sum(new_data)
Output:
[4, 4, 1, 2]
11
No need for that list. You can remove the "non-repeated" digits from the string already:
>>> n = 1234441122
>>> import re
>>> sum(map(int, re.sub(r'(.)(?!\1)', '', str(n))))
11
You are simply iterating on string and converting character to integer. You need to iterate and compare to next character.
a = str(1234441122)
sum = 0
for i,j in enumerate(a[:-1]):
if a[i] == a[i+1]:
sum+=int(a[i])
print(sum)
Output
11
Try this one too:
input1 = [int(i) for i in str(1234441122)]
x= 0
res = [input1[i] for i in range(len(input1)-1) if input1[i+1]==input1[i]]
print(res)
print(sum(res))
Output:
[4, 4, 1, 2]
11
Here's a slightly more space efficient version of #schwobaseggl's answer.
>>> lst = [1, 2, 3, 4, 4, 4, 1, 1, 2, 2]
>>> it = iter(lst)
>>> next(it) # throw away first value
>>> sum(x for x,y in zip(lst, it) if x == y)
11
Alernatively, using an islice from the itertools module is equivalent but looks a bit nicer.
>>> from itertools import islice
>>> sum(x for x,y in zip(lst, islice(lst, 1, None, 1)) if x == y)
11

Categories