Using bool as key function in groupby() in Python - python

I am studying list in python and I wonder how to input lists inside a list and saw this method to input multiple list inside a list:
from itertools import groupby
values = [0, 1, 2, 3, 4, 5, 351, 0, 1, 2, 3, 4, 5, 6, 750, 0, 1, 2, 3, 4, 559]
print([[0]+list(g) for k, g in groupby(values, bool) if k])
I tried to execute this in http://pythontutor.com/ to see the process step-by-step and I don't know what bool is checking if true or false before printing this output: [[0, 1, 2, 3, 4, 5, 351], [0, 1, 2, 3, 4, 5, 6, 750], [0, 1, 2, 3, 4, 559]]
link of the code above: Creating a list within a list in Python

Best way to understand such problems is to dissect it.
for k,g in groupby(values,bool):
if k:
print(*g)
# output
1 2 3 4 5 351
1 2 3 4 5 6 750
1 2 3 4 559
for k,g in groupby(values,bool):
if not k:
print(*g)
# output
0
0
0
Python considers bool(0) as False. So basically, it groups all the numbers between False together and it adds them to [0] therefore you get your result. bool(any non 0) is True.
for k,g in groupby(values,bool):
print(k)
print(*g)
#output
False
0
True
1 2 3 4 5 351
False
0
True
1 2 3 4 5 6 750
False
0
True
1 2 3 4 559
for k,g in groupby(values,bool):
print(k)
if k:
print([0]+list(g))
# output
False
True
[0, 1, 2, 3, 4, 5, 351]
False
True
[0, 1, 2, 3, 4, 5, 6, 750]
False
True
[0, 1, 2, 3, 4, 559]
Note
I might have explained myself loosely, what I meant by it groups what's between False together, is because they are True.

Related

Compare 3 columns of a 2-D List and Replace based on conditions

I have a 2-D List as follows:
[
[6 4 4 2 5 5 4 5 4 1 3 5]
[4 3 6 5 4 4 5 1 5 5 2 4]
[2 5 2 0 4 5 4 4 2 3 2 6]
[5 5 4 3 5 4 6 7 3 4 4 4]
[3 5 6 5 6 5 3 5 3 4 7 4]
[4 5 5 4 5 4 7 5 3 5 4 1]
[2 5 3 3 5 3 4 4 3 3 1 3]
[2 5 5 2 5 4 6 2 5 6 2 5]
]
Conditions:
compare column 1,5 and 9 (in steps of 4) - row-wise and process them in the following order
If one of them is zero - do nothing. Go to Step 2
(6,5,4) - none of them zero so go to step 2
If they are all equal - change all of them to zero. If not go Step 3
Take the lowest of the three and subtract each by this minimum
Repeat this with next three elements (2,6,10) until (4,8,12)
How to do efficiently this in python using pandas or numpy or even list operation.
Any help appreciated. Thanks!
You could write a custom function and then apply that functions to every element in the array.
def check_conditions(x):
for i in range(4):
if x[i] == 0 or x[i+4] == 0 or x[i+8] == 0:
continue
elif x[i] == x[i+4] == x[i+8]:
x[i] = 0
x[i+4] = 0
x[i+8] = 0
else:
min_val = min(x[i], x[i+4], x[i+8])
x[i] -= min_val
x[i+4] -= min_val
x[i+8] -= min_val
return x
new_arr = [check_conditions(x) for x in arr]
To get the following result.
print(new_arr)
[[2, 3, 1, 2, 1, 4, 1, 5, 0, 0, 0, 5],
[0, 0, 4, 5, 0, 1, 3, 1, 1, 2, 0, 4],
[0, 2, 0, 0, 2, 2, 2, 4, 0, 0, 0, 6],
[2, 1, 0, 3, 2, 0, 2, 7, 0, 0, 0, 4],
[0, 1, 3, 5, 3, 1, 0, 5, 0, 0, 4, 4],
[1, 1, 1, 4, 2, 0, 3, 5, 0, 1, 0, 1],
[0, 2, 2, 3, 3, 0, 3, 4, 1, 0, 0, 3],
[0, 1, 3, 2, 3, 0, 4, 2, 3, 2, 0, 5]]

Replace consecutive identic elements in the beginning of an array with 0

I want to replace the N first identic consecutive numbers from an array with 0.
import numpy as np
x = np.array([1, 1, 1, 1, 2, 3, 1, 2, 3, 2, 2, 2, 3, 3, 3, 1, 1, 2, 2])
OUT -> np.array([0, 0, 0, 0 2, 3, 1, 2, 3, 2, 2, 2, 3, 3, 3, 1, 1, 2, 2])
Loop works, but what would be a faster-vectorized implementation?
i = 0
first = x[0]
while x[i] == first and i <= x.size - 1:
x[i] = 0
i += 1
You can use argmax on a boolean array to get the index of the first changing value.
Then slice and replace:
n = (x!=x[0]).argmax() # 4
x[:n] = 0
output:
array([0, 0, 0, 0, 2, 3, 1, 2, 3, 2, 2, 2, 3, 3, 3, 1, 1, 2, 2])
intermediate array:
(x!=x[0])
# n=4
# [False False False False True True True True True True True True
# True True True True True True True]
My solution is based on itertools.groupby, so start from import itertools.
This function creates groups of consecutive equal values, contrary to e.g.
the pandasonic version of groupby, which collects withis a single group all
equal values from the input.
Another important feature is that you can assign any value to N and
replaced will be only the first N of a sequence of consecutive values.
To test my code, I set N = 4 and defined the source array as:
x = np.array([1, 1, 1, 1, 2, 3, 1, 2, 3, 2, 2, 2, 3, 3, 3, 1, 1, 2, 2, 2, 2, 2])
Note that it contains 5 consecutive values of 2 at the end.
Then, to get the expected result, run:
rv = []
for key, grp in itertools.groupby(x):
lst = list(grp)
lgth = len(lst)
if lgth >= N:
lst[0:N] = [0] * N
rv.extend(lst)
xNew = np.array(rv)
The result is:
[0, 0, 0, 0, 2, 3, 1, 2, 3, 2, 2, 2, 3, 3, 3, 1, 1, 0, 0, 0, 0, 2]
Note that a sequence of 4 zeroes occurs:
at the beginning (all 4 values of 1 have been replaced),
almost at the end (from 5 values of 2 first 4 have been replaced).

How to store numbers in chunks from an array and create another array or list? [duplicate]

This question already has answers here:
How to Split or break a Python list into Unequal chunks, with specified chunk sizes
(3 answers)
Python For Loop Appending Only Last Value to List
(2 answers)
Closed 1 year ago.
I have two arrays x and y:
x = [2 3 1 1 2 5 7 3 6]
y = [0 0 4 2 4 5 8 4 5 6 7 0 5 3 2 8 1 3 1 0 4 2 4 5 4 4 5 6 7 0]
I want to create a list "z" and want to store group/chunks of numbers from y into z and the size of groups is defined by the values of x.
so z store numbers as
z = [[0,0],[4,2,4],[5],[8],[4,5],[6,7,0,5,3],[2,8,1,3,1,0,4],[2,4,5],[4,4,5,6,7,0]]
I tried this loop:
h=[]
for j in x:
h=[[a] for i in range(j) for a in y[i:i+1]]
But it is only storing for last value of x.
Also I am not sure whether the title of this question is appropriate for this problem. Anyone can edit if it is confusing. Thank you so much.
You're reassigning h each time through the loop, so it ends up with just the last iteration's assignment.
You should append to it, not assign it.
start = 0
for j in x:
h.append(y[start:start+j])
start += j
Another way to do it would be by using (and consuming as you do) an iterator like so:
x = [2, 3, 1, 1, 2, 5, 7, 3, 6]
y = [0, 0, 4, 2, 4, 5, 8, 4, 5, 6, 7, 0, 5, 3, 2, 8, 1, 3, 1, 0, 4, 2, 4, 5, 4, 4, 5, 6, 7, 0]
yi = iter(y)
res = [[next(yi) for _ in range(i)] for i in x]
print(res) # -> [[0, 0], [4, 2, 4], [5], [8], [4, 5], [6, 7, 0, 5, 3], [2, 8, 1, 3, 1, 0, 4], [2, 4, 5], [4, 4, 5, 6, 7, 0]]
Aside of the problem you are facing, and as a general rule to live by, try to give more meaningful names to your variables.

Faster way to count values by a "kind" and update value in the DataFrame with that count?

I am trying to implement a step in a script where I look up, in each row, the "kind" of a value, which is stored in the same DataFrame, and update a count per row of how many values are of each "kind". To illustrate, here is a toy example:
d = {0: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
1: [1, 1, 2, 2, 1, 1, 2, 1, 1, 2],
2: [1, 1, 2, 2, 1, 1, 1, 1, 2, 2],
3: [2, 1, 8, 3, 6, 5, 10, 3, 4, 7],
4: [0, 0, 4, 9, 0, 0, 0, 0, 10, 9],
5: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
6: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
df = pd.DataFrame(d)
df.index += 1
In df, df[0] contains a unique ID of an object, df[1], contains the "kind" (this could be like a color of the object). df[3] and df[4] contain adjacent objects of interest (0 is a placeholder value, and any nonzero value is the ID of an adjacent object, so here we either have 1 or 2 adjacent objects). df[5] and df[6] are for storing how many objects are of each type. Here there are just two types, which are ints, so counts for adjacent objects of type 1 go in df[5] and adjacent objects of type 2 go in df[6].
I have working code that iterates over the rows and adjacent object columns, and looks up the type, then increments the appropriate column. However, this does not scale well, and my actual datasets have many more rows and object types, and this operation is called repeatedly as part of a Monte-Carlo type simulation. I'm not exactly sure what could be done here to speed it up, I've tried just a dictionary lookup of ID:Type, but that was actually slower. Here is the functional code:
def countNeighbors(contactMap): #in case of subgraph, still need to know the neighbors type
for index, row in contactMap.iterrows():
for col in range(3,4):
cellID = row[col]
if cellID == 0:
pass
else:
cellType = int(contactMap[contactMap[0] == cellID][1])
contactMap.at[index, 4+cellType] += 1
return contactMap
df = countNeighbors(df)
Expected output:
output = {0: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1: [1, 1, 2, 2, 1, 1, 2, 1, 1, 2], 2: [1, 1, 2, 2, 1, 1, 1, 1, 2, 2], 3: [2, 1, 8, 3, 6, 5, 10, 3, 4, 7], 4: [0, 0, 4, 9, 0, 0, 0, 0, 10, 9], 5: [1, 1, 1, 0, 1, 1, 0, 0, 0, 0], 6: [0, 0, 0, 1, 0, 0, 1, 1, 1, 1]}
out_df = pd.DataFrame(output)
out_ df.index += 1
So to be clear, this output means that object 1 (row 1), is of type 1, with 1 adjacent object, Object 2. We look up Object 2 in df and see that it is of type 1, and so increment col 5.
Is there a faster way to accomplish the same effect? I'm open to redesigning the data structure if required, but this format is convenient.
Option 1:
type_dict = df.set_index(0)[1].to_dict()
for i in [3,4]:
s = df[i].map(type_dict)
df.loc[:,[5,6]] += pd.get_dummies(s)[[1,2]].values
Option 2:
df.loc[:,[5,6]] = (pd.get_dummies(df[[3,4]]
.stack().map(type_dict))
.sum(level=0)
)
Output:
0 1 2 3 4 5 6
1 1 1 1 2 0 1 0
2 2 1 1 1 0 1 0
3 3 2 2 8 4 1 1
4 4 2 2 3 9 1 1
5 5 1 1 6 0 1 0
6 6 1 1 5 0 1 0
7 7 2 1 10 0 0 1
8 8 1 1 3 0 0 1
9 9 1 2 4 10 0 2
10 10 2 2 7 9 1 1

pandas equivalent to R series of multiple repeated numbers

I want to create a simple vector of many repeated values. This is easy in R:
> numbers <- c(rep(1,5), rep(2,4), rep(3,3))
> numbers
[1] 1 1 1 1 1 2 2 2 2 3 3 3
However, if I try to do this in Python using pandas and numpy, I don't quite get the same thing:
numbers = pd.Series([np.repeat(1,5), np.repeat(2,4), np.repeat(3,3)])
numbers
0 [1, 1, 1, 1, 1]
1 [2, 2, 2, 2]
2 [3, 3, 3]
dtype: object
What's the R equivalent in Python?
Just adjust how you use np.repeat
np.repeat([1, 2, 3], [5, 4, 3])
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3])
Or with pd.Series
pd.Series(np.repeat([1, 2, 3], [5, 4, 3]))
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 3
10 3
11 3
dtype: int64
That said, the purest form to replicate what you've done in R is to use np.concatenate in conjunction with np.repeat. It just isn't what I'd recommend doing.
np.concatenate([np.repeat(1,5), np.repeat(2,4), np.repeat(3,3)])
array([1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3])
Now you can use the same syntax in python:
>>> from datar.base import c, rep
>>>
>>> numbers = c(rep(1,5), rep(2,4), rep(3,3))
>>> print(numbers)
[1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3]
I am the author of the datar package. Feel free to submit issues if you have any questions.

Categories