Related
I have some dataset which looks like [3,4,5,-5,4,5,6,3,2-6,6]
I want to create a dataset that will always have 0 for indexes which match first sequence of positive numbers from dataset 1, and 1 for indexes which remain.
So for a = [3,4,5,-5,4,5,6,3,2-6,6] it should be
b = [0,0,0, 1,1,1,1,1,1,1]
How can produce b from a if I use pandas and python ?
Since you tagged pandas, here is a solution using a Series:
import pandas as pd
s = pd.Series([3, 4, 5, -5, 4, 5, 6, 3, 2 - 6, 6])
# find the first index that is greater than zero
idx = (s > 0).idxmin()
# using the index set all the values before as 0, otherwise 1
res = pd.Series(s.index >= idx, dtype=int)
print(res)
Output
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
If you prefer a one-liner:
res = pd.Series(s.index >= (s > 0).idxmin(), dtype=int)
You can use a cummax on the boolean series:
s = pd.Series([3, 4, 5, -5, 4, 5, 6, 3, 2 - 6, 6])
out = s.lt(0).cummax().astype(int)
Output:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
If you are really working with lists, then pandas is not needed and numpy should be more efficient:
import numpy as np
a = [3,4,5,-5,4,5,6,3,2-6,6]
b = np.maximum.accumulate(np.array(a)<0).astype(int).tolist()
Output: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
And if the list is small, pure python should be preferred:
from itertools import accumulate
b = list(accumulate((int(x<0) for x in a), max))
Output: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
Let's say I have a machine that sends out 4 bits every second and I want to see the amount of times a certain bit signature is sent over time.
I am given an input list of lists that contain a message in bits that change over time.
For my output I would like a list of dictionaries, per bit pair, containing the unique bit pair as the key and the times it appears as the value.
Edit New Example:
For example this following data set would be a representation of that data. With the horizontal axis being bit position and the vertical axis being samples over time. So for the following example I have 4 total bits and 6 total samples.
a = [
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 0, 0, 0],
[1, 0, 1, 0]])
For this data set I am trying to get a count of how many times a certain bit string occurs this length should be able to vary but for this example let's say I am doing 2 bits at a time.
So the first sample [0,0,1,1] would be split into this
[00,01,11] and the second would be [01,11,11] and the third would be [11,11,11] and so on. Producing a list like so:
y = [
[00,01,11],
[01,11,11],
[11,11,11],
[11,11,11],
[00,00,00],
[10,01,10]]
From this I want to be able to count each unique signature and to produce a dictionary with keys corresponding to the signature and values to the counts.
The dictionary would like this
z = [
{'00':2, '01':1, '11':2, '10':1},
{'00':1, '01':2, '11':3},
{'00':1, '11':4], '10':1}]
Finding the counts is easy if a have a list of parsed items. However getting from the raw data to that parsed list is where I am currently having some trouble. I have an implementation but it's essentially 3 for loops and it runs really slow over large dataset. Surely there is a better and more pythonic way to get about this?
I am using numpy for some additional calculation later on in my program so I would not be against using it here.
UPDATE:
I have been looking around at other things and came to this. Not sure if this is the best solution either.
import numpy as np
a = np.array([
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
my_list = a.astype(str).tolist()
# How many elements each
# list should have
n = 2
# using list comprehension
final = [([''.join(c[i:(i) + n]) for i in range((len(c) + n) // n)]) for c in my_list]
final = [['00', '01', '11'], ['01', '11', '11'], ['11', '11', '11']]
UPDATE 2:
I have ran the following implementations and tested there speeds and here is what I have came up with.
Running the data on the small example of 4 bits and 4 samples with a width of 2.
x = [
[0,0,1,1],
[0,1,1,1],
[1,1,1,1]]
My implementation took 0.0003 seconds
Kasrâmvd's implementation took 0.0002 seconds
Chris' implementation took 0.0002 seconds
Paul's implementation took 0.0243 seconds
However when running against an actual dataset of 64 bits and 23,497 samples with a width of 2. I got these results:
My implementation took 1.5302 seconds
Kasrâmvd's implementation took 0.3913 seconds
Chris' Implementation took 2.0802 seconds
Paul's implementation took 0.0204 seconds
Here is an approach using convolution. As fast convolution depends on FFT and therefore needs to do computations with floats, we have 52 bits mantissa and 53 is the maximum pattern length we can handle.
import itertools as it
import numpy as np
import scipy.signal as ss
MAX_BITS = np.finfo(float).nmant + 1
def sliding_window(data, width, return_keys=True, return_dict=True, prune_empty=True):
n, m = data.shape
if width > MAX_BITS:
raise ValueError(f"max window width is {MAX_BITS}")
patterns = ss.convolve(data, 1<<np.arange(width)[None], 'valid', 'auto').astype(int)
patterns += np.arange(m-width+1)*(1<<width)
cnts = np.bincount(patterns.ravel(), None, (m-width+1)*(1<<width)).reshape(m-width+1,-1)
if return_keys or return_dict:
keys = np.array([*map("".join, it.product(*width*("01",)))], 'S')
if return_dict:
dt = np.dtype([('key', f'S{width}'), ('value', int)])
aux = np.empty(cnts.shape, dt)
aux['value'] = cnts
aux['key'] = keys
if prune_empty:
i,j = np.where(cnts)
return [*map(dict, np.split(aux[i,j],
i.searchsorted(np.arange(1,m-width+1))))]
return [*map(dict, aux.tolist())]
return keys, cnts
return cnts
example = np.random.randint(0, 2, (10,10))
print(example)
print(sliding_window(example,3))
Sample run:
[[0 1 1 1 0 1 1 1 1 1]
[0 0 1 0 1 0 0 1 0 1]
[0 0 1 0 1 1 1 0 1 1]
[1 1 1 1 1 0 0 0 1 0]
[0 0 0 0 1 1 1 0 0 0]
[1 1 0 0 0 1 0 0 1 1]
[0 1 1 1 0 1 1 1 1 1]
[0 1 0 0 0 1 1 0 0 1]
[1 0 1 1 0 1 1 0 1 0]
[0 0 1 1 0 1 0 1 0 0]]
[{b'000': 1, b'001': 3, b'010': 1, b'011': 2, b'101': 1, b'110': 1, b'111': 1}, {b'000': 1, b'010': 2, b'011': 2, b'100': 2, b'111': 3}, {b'000': 2, b'001': 1, b'101': 2, b'110': 4, b'111': 1}, {b'001': 2, b'010': 1, b'011': 2, b'101': 4, b'110': 1}, {b'010': 2, b'011': 4, b'100': 2, b'111': 2}, {b'000': 1, b'001': 1, b'100': 1, b'101': 1, b'110': 4, b'111': 2}, {b'001': 2, b'010': 2, b'100': 2, b'101': 2, b'111': 2}, {b'000': 1, b'001': 1, b'010': 2, b'011': 2, b'100': 1, b'101': 1, b'111': 2}]
If you wanna have a geometrical or algebraic analysis/solution you can do the following:
In [108]: x = np.array([[0,0,1,1],
...: [0,1,1,1],
...: [1,1,1,1]])
...:
In [109]:
In [109]: pairs = np.dstack((x[:, :-1], x[:, 1:]))
In [110]: x, y, z = pairs.shape
In [111]: uniques
Out[111]:
array([[0, 0],
[0, 1],
[1, 1]])
In [112]: uniques = np.unique(pairs.reshape(x*y, z), axis=0)
# None: 3d broadcasting is not recommended in any situation, please read doc for more details,
In [113]: R = (uniques[:,None][:,None,:] == pairs).all(3).sum(-1)
In [114]: R
Out[114]:
array([[1, 0, 0],
[1, 1, 0],
[1, 2, 3]])
The columns of matrix R stand for the count of each unique pair in uniques object in each row of your original array.
You can then get a Python object like what you want as following:
In [116]: [{tuple(i): j for i,j in zip(uniques, i) if j} for i in R.T]
Out[116]: [{(0, 0): 1, (0, 1): 1, (1, 1): 1}, {(0, 1): 1, (1, 1): 2}, {(1, 1): 3}]
This solution doesn't pair the bits, but gives them as tuples (although that should be simple enough to do).
EDIT: formed strings of bits as needed.
from collections import Counter
x = [[0,0,1,1],
[0,1,1,1],
[1,1,1,1]]
y = [[''.join(map(str, ref[j:j+2])) for j in range(len(x[0])-1)] \
for ref in x]
for bit in y:
d = Counter(bit)
print(d)
Prints
Counter({'00': 1, '01': 1, '11': 1})
Counter({'11': 2, '01': 1})
Counter({'11': 3})
EDIT: To increase the window from 2 to 3, you might add this to your code:
window = 3
offset = window - 1
y = [[''.join(map(str, ref[j:j+window])) for j in range(len(x[0])-offset)] \
for ref in x]
Given a list: [10, 4, 9, 3, 2, 5, 8, 1, 0]
that has the heap structure of below:
8
9
5
10
2
4
0
3
1
What is a good algorithm in python to get [4,3,2,1,0] which is basically the left child of 10.
parent is (index+1)//2
left child is 2i+1, right child is 2i+2
L = [10, 4, 9, 3, 2, 5, 8, 1, 0]
index = 1
newheap = []
newheap.append(L[index])
leftc = 2 * index + 1
rightc = 2 * index + 2
while(leftc < len(L)):
newheap.append(L[leftc])
if(rightc < len(L)):
newheap.append(L[rightc])
leftc = 2 * leftc + 1
rightc = 2 * rightc + 2
print(newheap)
which outputs
[4,3,2,1]
but I need [4,3,2,1, 0], so not what I wanted. I started the index at 1 which points to 4.
Would recursion be better? Not sure how to go about this.
You can try something like that :
L = [10, 4, 9, 3, 2, 5, 8, 1, 0]
index = 0
offset = 1
newheap = []
while index < len(L):
index += offset
for i in range(offset):
if index+i == len(L):
break
newheap += [L[index+i]]
offset = 2 * offset
print(newheap)
I need to create a random array of 6 integers between 1 and 5 in Python but I also have another data say a=[2 2 3 1 2] which can be considered as the capacity. It means 1 can occur no more than 2 times or 3 can occur no more than 3 times.
I need to set up a counter for each integer from 1 to 5 to make sure each integer is not generated by the random function more than a[i].
Here is the initial array I created in python but I need to find out how I can make sure about the condition I described above. For example, I don't need a solution like [2 1 5 4 5 4] where 4 is shown twice or [2 2 2 2 1 2].
solution = np.array([np.random.randint(1,6) for i in range(6)])
Even if I can add probability, that should work. Any help is appreciated on this.
You can create an pool of data that have the most counts and then pick from there:
import numpy as np
a = [2, 2, 3, 1, 2]
data = [i + 1 for i, e in enumerate(a) for _ in range(e)]
print(data)
result = np.random.choice(data, 6, replace=False)
print(result)
Output
[1, 1, 2, 2, 3, 3, 3, 4, 5, 5]
[1 3 2 2 3 1]
Note that data is array that has for each element the specified count, then we pick randomly from data this way we ensure that you won't have more elements that the specify count.
UPDATE
If you need that each number appears at least one time, you can start with a list of each of the numbers, sample from the rest and then shuffle:
import numpy as np
result = [1, 2, 3, 4, 5]
a = [1, 1, 2, 0, 1]
data = [i + 1 for i, e in enumerate(a) for _ in range(e)]
print(data)
result = result + np.random.choice(data, 1, replace=False).tolist()
np.random.shuffle(result)
print(result)
Output
[1, 2, 3, 3, 5]
[3, 4, 2, 5, 1, 2]
Notice that I subtract 1 from each of the original values of a, also the original 6 was change to 1 because you already have 5 numbers in the variable result.
You could test your count against a dictionary
import random
a = [2, 2, 3, 1, 2]
d = {idx: item for idx,item in enumerate(a, start = 1)}
l = []
while len(set(l) ^ set([*range(1, 6)])) > 0:
l = []
while len(l) != 6:
x = random.randint(1,5)
while l.count(x) == d[x]:
x = random.randint(1,5)
l.append(x)
print(l)
I have the following pandas series (represented as a list):
[7,2,0,3,4,2,5,0,3,4]
I would like to define a new series that returns distance to the last zero. It means that I would like to have the following output:
[1,2,0,1,2,3,4,0,1,2]
How to do it in pandas in the most efficient way?
The complexity is O(n). What will slow it down is doing a for loop in python. If there are k zeros in the series, and log k is negligibile comparing to the length of series, an O(n log k) solution would be:
>>> izero = np.r_[-1, (ts == 0).nonzero()[0]] # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])
A solution in Pandas is a little bit tricky, but could look like this (s is your Series):
>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
For the last step, this uses the "itertools.groupby" recipe in the Pandas cookbook here.
A solution that may not be as performant (haven't really checked), but easier to understand in terms of the steps (at least for me), would be:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df
df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']
df
It's sometimes surprising to see how simple it is to get c-like speeds for this stuff using Cython. Assuming your column's .values gives arr, then:
cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret
cdef int i, zero_count = 0
for i in range(len(ret)):
zero_count = 0 if arr_view[i] == 0 else zero_count + 1
ret_view[i] = zero_count
Note the use of typed memory views, which are extremely fast. You can speed it further using #cython.boundscheck(False) decorating a function using this.
Another option
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]
def d0(a):
return np.min(a[a>=0])
df.index.to_series().apply(lambda i: d0(i - zeros))
Or using pure numpy
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]
np.min(a, where=a>=0, axis=1, initial=len(df))
Yet another way to do this using Numpy accumulate. The only catch is, to initialize the counter at zero you need to insert a zero infront of the series values.
import numpy as np
# Define Python function
f = lambda a, b: 0 if b == 0 else a + 1
# Convert to Numpy ufunc
npf = np.frompyfunc(f, 2, 1)
# Apply recursively over series values
x = npf.accumulate(np.r_[0, s.values])[1:]
print(x)
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2], dtype=object)
Here is a way without using groupby:
((v:=pd.Series([7,2,0,3,4,2,5,0,3,4]).ne(0))
.cumsum()
.where(v.eq(0)).ffill().fillna(0)
.rsub(v.cumsum())
.astype(int)
.tolist())
Output:
[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]
Maybe pandas is not the best tool for this as in the answer by #behzad.nouri, however here is another variation:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
z = df.ne(0).X
z.groupby((z != z.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
Name: X, dtype: int64
Solution 2:
If you write the following code you will get almost everything you need, except that the first row starts from 0 and not 1:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df.eq(0).cumsum().groupby('X').cumcount()
0 0
1 1
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
This happened because cumulative sum starts the counting from 0. To get the desired results, I added a 0 to the first row, calculated everything and then dropped the 0 at the end to get:
x = pd.Series([0], index=[0])
df = pd.concat([x, df])
df.eq(0).cumsum().groupby('X').cumcount().reset_index(drop=True).drop(0).reset_index(drop=True)
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64