Related
I have some dataset which looks like [3,4,5,-5,4,5,6,3,2-6,6]
I want to create a dataset that will always have 0 for indexes which match first sequence of positive numbers from dataset 1, and 1 for indexes which remain.
So for a = [3,4,5,-5,4,5,6,3,2-6,6] it should be
b = [0,0,0, 1,1,1,1,1,1,1]
How can produce b from a if I use pandas and python ?
Since you tagged pandas, here is a solution using a Series:
import pandas as pd
s = pd.Series([3, 4, 5, -5, 4, 5, 6, 3, 2 - 6, 6])
# find the first index that is greater than zero
idx = (s > 0).idxmin()
# using the index set all the values before as 0, otherwise 1
res = pd.Series(s.index >= idx, dtype=int)
print(res)
Output
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
If you prefer a one-liner:
res = pd.Series(s.index >= (s > 0).idxmin(), dtype=int)
You can use a cummax on the boolean series:
s = pd.Series([3, 4, 5, -5, 4, 5, 6, 3, 2 - 6, 6])
out = s.lt(0).cummax().astype(int)
Output:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
If you are really working with lists, then pandas is not needed and numpy should be more efficient:
import numpy as np
a = [3,4,5,-5,4,5,6,3,2-6,6]
b = np.maximum.accumulate(np.array(a)<0).astype(int).tolist()
Output: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
And if the list is small, pure python should be preferred:
from itertools import accumulate
b = list(accumulate((int(x<0) for x in a), max))
Output: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
I have a list that looks like:
mot = [0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,0,0,0]
I need to append to a list, the index when the element changes from 0 to 1 (and not from 1 to 0).
I've tried to do the following, but it also registers when it changes from 1 to 0.
i = 0
while i != len(mot)-1:
if mot[i] != mot[i+1]:
mot_daily_index.append(i)
i += 1
Also, but not as important, is there a cleaner implementation?
Here is how you can do that with a list comprehension:
mot = [0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,0,0,0]
mot_daily_index = [i for i,m in enumerate(mot) if i and m and not mot[i-1]]
print(mot_daily_index)
Output:
[7, 24]
Explanation:
list(enumerate([7,5,9,3])) will return [(0, 7), (1, 5), (2, 9), (3, 3)], so the i in i for i, m in enumerate, is the index of m during that iteration.
Use a list comprehension with a filter to get your indexes:
mot = [0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,0,0,0]
idx = [i for i,v in enumerate(mot) if i and v > mot[i-1]]
print(idx)
Output:
[7, 24]
You could use
lst = [0, 0, 0, 1, 1, 1, 0, 1]
# 0 1 2 3 4 5 6 7
for index, (x, y) in enumerate(zip(lst, lst[1:])):
if x == 0 and y == 1:
print("Changed from 0 to 1 at", index)
Which yields
Changed from 0 to 1 at 2
Changed from 0 to 1 at 6
Here's a solution using itertools.groupby to group the list into 0's and 1's:
from itertools import groupby
mot = [0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,0,0,0]
mot_daily_index = []
l = 0
for s, g in groupby(mot):
if s == 1:
mot_daily_index.append(l)
l += len(list(g))
print(mot_daily_index)
Output:
[7, 24]
mot = [0,0,0,0,1,0,1,0,1,1,1,0,1,1,1,0,0,0,0]
mot_daily_index = [] # the required list
for i in range(len(a)-1):
if a[i]==0 and a[i+1]==1:
ind.append(i)
your code adds index whenever ith element is different from (i+1)th element
For a 3M element container, this answer is 67.2 times faster than the accepted answer.
This can be accomplished with numpy, by converting the list to a numpy.array.
The code for his answer, is a modification of the code from Find index where elements change value numpy.
That question wanted all transitions v[:-1] != v[1:], not just the small to large transitions, v[:-1] < v[1:], in this question.
Create a Boolean array, by comparing the array to itself, shifted by one place.
Use np.where to return the indices for True
This finds the index before the change, because the arrays are shifted for comparison, so use +1 to get the correct value.
import numpy as np
v = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
# convert to array
v = np.array(v)
# create a Boolean array
map_ = v[:-1] < v[1:]
# return the indices
idx = np.where(map_)[0] + 1
print(idx)
[out]:
array([ 7, 24], dtype=int64)
%timeit
# v is 3M elements
v = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0] * 100000
# accepted answer
%timeit [i for i,m in enumerate(v) if i and m and not v[i-1]]
[out]:
336 ms ± 14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# this answer
v = np.array(v)
%timeit np.where(v[:-1] < v[1:])[0] + 1
[out]:
5.03 ms ± 85.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
A oneliner using zip:
mot = [0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,1,1,0,0,0]
[i+1 for i,m in enumerate(zip(mot[:-1],mot[1:])) if m[0]<m[1]]
# [7, 24]
Another list comprehension take:
mot = [0,1,1,1,1,0,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0]
change_mot = [index+1 for index, value in enumerate(zip(mot[:-1], mot[1:], )) if value[1] - value[0] == 1]
Which yields
[1, 8, 11, 15]
This picks up the increase and records the index only if the increase = 1.
Let's say I have a machine that sends out 4 bits every second and I want to see the amount of times a certain bit signature is sent over time.
I am given an input list of lists that contain a message in bits that change over time.
For my output I would like a list of dictionaries, per bit pair, containing the unique bit pair as the key and the times it appears as the value.
Edit New Example:
For example this following data set would be a representation of that data. With the horizontal axis being bit position and the vertical axis being samples over time. So for the following example I have 4 total bits and 6 total samples.
a = [
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 0, 0, 0],
[1, 0, 1, 0]])
For this data set I am trying to get a count of how many times a certain bit string occurs this length should be able to vary but for this example let's say I am doing 2 bits at a time.
So the first sample [0,0,1,1] would be split into this
[00,01,11] and the second would be [01,11,11] and the third would be [11,11,11] and so on. Producing a list like so:
y = [
[00,01,11],
[01,11,11],
[11,11,11],
[11,11,11],
[00,00,00],
[10,01,10]]
From this I want to be able to count each unique signature and to produce a dictionary with keys corresponding to the signature and values to the counts.
The dictionary would like this
z = [
{'00':2, '01':1, '11':2, '10':1},
{'00':1, '01':2, '11':3},
{'00':1, '11':4], '10':1}]
Finding the counts is easy if a have a list of parsed items. However getting from the raw data to that parsed list is where I am currently having some trouble. I have an implementation but it's essentially 3 for loops and it runs really slow over large dataset. Surely there is a better and more pythonic way to get about this?
I am using numpy for some additional calculation later on in my program so I would not be against using it here.
UPDATE:
I have been looking around at other things and came to this. Not sure if this is the best solution either.
import numpy as np
a = np.array([
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
my_list = a.astype(str).tolist()
# How many elements each
# list should have
n = 2
# using list comprehension
final = [([''.join(c[i:(i) + n]) for i in range((len(c) + n) // n)]) for c in my_list]
final = [['00', '01', '11'], ['01', '11', '11'], ['11', '11', '11']]
UPDATE 2:
I have ran the following implementations and tested there speeds and here is what I have came up with.
Running the data on the small example of 4 bits and 4 samples with a width of 2.
x = [
[0,0,1,1],
[0,1,1,1],
[1,1,1,1]]
My implementation took 0.0003 seconds
Kasrâmvd's implementation took 0.0002 seconds
Chris' implementation took 0.0002 seconds
Paul's implementation took 0.0243 seconds
However when running against an actual dataset of 64 bits and 23,497 samples with a width of 2. I got these results:
My implementation took 1.5302 seconds
Kasrâmvd's implementation took 0.3913 seconds
Chris' Implementation took 2.0802 seconds
Paul's implementation took 0.0204 seconds
Here is an approach using convolution. As fast convolution depends on FFT and therefore needs to do computations with floats, we have 52 bits mantissa and 53 is the maximum pattern length we can handle.
import itertools as it
import numpy as np
import scipy.signal as ss
MAX_BITS = np.finfo(float).nmant + 1
def sliding_window(data, width, return_keys=True, return_dict=True, prune_empty=True):
n, m = data.shape
if width > MAX_BITS:
raise ValueError(f"max window width is {MAX_BITS}")
patterns = ss.convolve(data, 1<<np.arange(width)[None], 'valid', 'auto').astype(int)
patterns += np.arange(m-width+1)*(1<<width)
cnts = np.bincount(patterns.ravel(), None, (m-width+1)*(1<<width)).reshape(m-width+1,-1)
if return_keys or return_dict:
keys = np.array([*map("".join, it.product(*width*("01",)))], 'S')
if return_dict:
dt = np.dtype([('key', f'S{width}'), ('value', int)])
aux = np.empty(cnts.shape, dt)
aux['value'] = cnts
aux['key'] = keys
if prune_empty:
i,j = np.where(cnts)
return [*map(dict, np.split(aux[i,j],
i.searchsorted(np.arange(1,m-width+1))))]
return [*map(dict, aux.tolist())]
return keys, cnts
return cnts
example = np.random.randint(0, 2, (10,10))
print(example)
print(sliding_window(example,3))
Sample run:
[[0 1 1 1 0 1 1 1 1 1]
[0 0 1 0 1 0 0 1 0 1]
[0 0 1 0 1 1 1 0 1 1]
[1 1 1 1 1 0 0 0 1 0]
[0 0 0 0 1 1 1 0 0 0]
[1 1 0 0 0 1 0 0 1 1]
[0 1 1 1 0 1 1 1 1 1]
[0 1 0 0 0 1 1 0 0 1]
[1 0 1 1 0 1 1 0 1 0]
[0 0 1 1 0 1 0 1 0 0]]
[{b'000': 1, b'001': 3, b'010': 1, b'011': 2, b'101': 1, b'110': 1, b'111': 1}, {b'000': 1, b'010': 2, b'011': 2, b'100': 2, b'111': 3}, {b'000': 2, b'001': 1, b'101': 2, b'110': 4, b'111': 1}, {b'001': 2, b'010': 1, b'011': 2, b'101': 4, b'110': 1}, {b'010': 2, b'011': 4, b'100': 2, b'111': 2}, {b'000': 1, b'001': 1, b'100': 1, b'101': 1, b'110': 4, b'111': 2}, {b'001': 2, b'010': 2, b'100': 2, b'101': 2, b'111': 2}, {b'000': 1, b'001': 1, b'010': 2, b'011': 2, b'100': 1, b'101': 1, b'111': 2}]
If you wanna have a geometrical or algebraic analysis/solution you can do the following:
In [108]: x = np.array([[0,0,1,1],
...: [0,1,1,1],
...: [1,1,1,1]])
...:
In [109]:
In [109]: pairs = np.dstack((x[:, :-1], x[:, 1:]))
In [110]: x, y, z = pairs.shape
In [111]: uniques
Out[111]:
array([[0, 0],
[0, 1],
[1, 1]])
In [112]: uniques = np.unique(pairs.reshape(x*y, z), axis=0)
# None: 3d broadcasting is not recommended in any situation, please read doc for more details,
In [113]: R = (uniques[:,None][:,None,:] == pairs).all(3).sum(-1)
In [114]: R
Out[114]:
array([[1, 0, 0],
[1, 1, 0],
[1, 2, 3]])
The columns of matrix R stand for the count of each unique pair in uniques object in each row of your original array.
You can then get a Python object like what you want as following:
In [116]: [{tuple(i): j for i,j in zip(uniques, i) if j} for i in R.T]
Out[116]: [{(0, 0): 1, (0, 1): 1, (1, 1): 1}, {(0, 1): 1, (1, 1): 2}, {(1, 1): 3}]
This solution doesn't pair the bits, but gives them as tuples (although that should be simple enough to do).
EDIT: formed strings of bits as needed.
from collections import Counter
x = [[0,0,1,1],
[0,1,1,1],
[1,1,1,1]]
y = [[''.join(map(str, ref[j:j+2])) for j in range(len(x[0])-1)] \
for ref in x]
for bit in y:
d = Counter(bit)
print(d)
Prints
Counter({'00': 1, '01': 1, '11': 1})
Counter({'11': 2, '01': 1})
Counter({'11': 3})
EDIT: To increase the window from 2 to 3, you might add this to your code:
window = 3
offset = window - 1
y = [[''.join(map(str, ref[j:j+window])) for j in range(len(x[0])-offset)] \
for ref in x]
I want to get the rank of each element, so I use argsort in numpy:
np.argsort(np.array((1,1,1,2,2,3,3,3,3)))
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
it give the same element the different rank, can I get the same rank like:
array([0, 0, 0, 3, 3, 5, 5, 5, 5])
If you don't mind a dependency on scipy, you can use scipy.stats.rankdata, with method='min':
In [14]: a
Out[14]: array([1, 1, 1, 2, 2, 3, 3, 3, 3])
In [15]: from scipy.stats import rankdata
In [16]: rankdata(a, method='min')
Out[16]: array([1, 1, 1, 4, 4, 6, 6, 6, 6])
Note that rankdata starts the ranks at 1. To start at 0, subtract 1 from the result:
In [17]: rankdata(a, method='min') - 1
Out[17]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
If you don't want the scipy dependency, you can use numpy.unique to compute the ranking. Here's a function that computes the same result as rankdata(x, method='min') - 1:
import numpy as np
def rankmin(x):
u, inv, counts = np.unique(x, return_inverse=True, return_counts=True)
csum = np.zeros_like(counts)
csum[1:] = counts[:-1].cumsum()
return csum[inv]
For example,
In [137]: x = np.array([60, 10, 0, 30, 20, 40, 50])
In [138]: rankdata(x, method='min') - 1
Out[138]: array([6, 1, 0, 3, 2, 4, 5])
In [139]: rankmin(x)
Out[139]: array([6, 1, 0, 3, 2, 4, 5])
In [140]: a = np.array([1,1,1,2,2,3,3,3,3])
In [141]: rankdata(a, method='min') - 1
Out[141]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
In [142]: rankmin(a)
Out[142]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
By the way, a single call to argsort() does not give ranks. You can find an assortment of approaches to ranking in the question Rank items in an array using Python/NumPy, including how to do it using argsort().
Alternatively, pandas series has a rank method which does what you need with the min method:
import pandas as pd
pd.Series((1,1,1,2,2,3,3,3,3)).rank(method="min")
# 0 1
# 1 1
# 2 1
# 3 4
# 4 4
# 5 6
# 6 6
# 7 6
# 8 6
# dtype: float64
With focus on performance, here's an approach -
def rank_repeat_based(arr):
idx = np.concatenate(([0],np.flatnonzero(np.diff(arr))+1,[arr.size]))
return np.repeat(idx[:-1],np.diff(idx))
For a generic case with the elements in input array not already sorted, we would need to use argsort() to keep track of the positions. So, we would have a modified version, like so -
def rank_repeat_based_generic(arr):
sidx = np.argsort(arr,kind='mergesort')
idx = np.concatenate(([0],np.flatnonzero(np.diff(arr[sidx]))+1,[arr.size]))
return np.repeat(idx[:-1],np.diff(idx))[sidx.argsort()]
Runtime test
Testing out all the approaches listed thus far to solve the problem on a large dataset.
Sorted array case :
In [96]: arr = np.sort(np.random.randint(1,100,(10000)))
In [97]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 635 µs per loop
In [98]: %timeit rankmin(arr)
1000 loops, best of 3: 495 µs per loop
In [99]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 826 µs per loop
In [100]: %timeit rank_repeat_based(arr)
10000 loops, best of 3: 200 µs per loop
Unsorted case :
In [106]: arr = np.random.randint(1,100,(10000))
In [107]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 963 µs per loop
In [108]: %timeit rankmin(arr)
1000 loops, best of 3: 869 µs per loop
In [109]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 1.17 ms per loop
In [110]: %timeit rank_repeat_based_generic(arr)
1000 loops, best of 3: 1.76 ms per loop
I've written a function for the same purpose. It uses pure python and numpy only. Please have a look. I put comments as well.
def my_argsort(array):
# this type conversion let us work with python lists and pandas series
array = np.array(array)
# create mapping for unique values
# it's a dictionary where keys are values from the array and
# values are desired indices
unique_values = list(set(array))
mapping = dict(zip(unique_values, np.argsort(unique_values)))
# apply mapping to our array
# np.vectorize works similar map(), and can work with dictionaries
array = np.vectorize(mapping.get)(array)
return array
Hope that helps.
Complex solutions are unnecessary for this problem.
> ary = np.sort([1, 1, 1, 2, 2, 3, 3, 3, 3]) # or anything; must be sorted.
> a = np.diff().cumsum(); a
array([0, 0, 1, 1, 2, 2, 2, 2])
> b = np.r_[0, a]; b # ties get first open rank
array([0, 0, 0, 1, 1, 2, 2, 2, 2])
> c = np.flatnonzero(ary[1:] != ary[:-1])
> np.r_[0, 1 + c][b] # ties get last open rank
array([0, 0, 0, 3, 3, 5, 5, 5])
I have the following pandas series (represented as a list):
[7,2,0,3,4,2,5,0,3,4]
I would like to define a new series that returns distance to the last zero. It means that I would like to have the following output:
[1,2,0,1,2,3,4,0,1,2]
How to do it in pandas in the most efficient way?
The complexity is O(n). What will slow it down is doing a for loop in python. If there are k zeros in the series, and log k is negligibile comparing to the length of series, an O(n log k) solution would be:
>>> izero = np.r_[-1, (ts == 0).nonzero()[0]] # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])
A solution in Pandas is a little bit tricky, but could look like this (s is your Series):
>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
For the last step, this uses the "itertools.groupby" recipe in the Pandas cookbook here.
A solution that may not be as performant (haven't really checked), but easier to understand in terms of the steps (at least for me), would be:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df
df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']
df
It's sometimes surprising to see how simple it is to get c-like speeds for this stuff using Cython. Assuming your column's .values gives arr, then:
cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret
cdef int i, zero_count = 0
for i in range(len(ret)):
zero_count = 0 if arr_view[i] == 0 else zero_count + 1
ret_view[i] = zero_count
Note the use of typed memory views, which are extremely fast. You can speed it further using #cython.boundscheck(False) decorating a function using this.
Another option
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]
def d0(a):
return np.min(a[a>=0])
df.index.to_series().apply(lambda i: d0(i - zeros))
Or using pure numpy
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]
np.min(a, where=a>=0, axis=1, initial=len(df))
Yet another way to do this using Numpy accumulate. The only catch is, to initialize the counter at zero you need to insert a zero infront of the series values.
import numpy as np
# Define Python function
f = lambda a, b: 0 if b == 0 else a + 1
# Convert to Numpy ufunc
npf = np.frompyfunc(f, 2, 1)
# Apply recursively over series values
x = npf.accumulate(np.r_[0, s.values])[1:]
print(x)
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2], dtype=object)
Here is a way without using groupby:
((v:=pd.Series([7,2,0,3,4,2,5,0,3,4]).ne(0))
.cumsum()
.where(v.eq(0)).ffill().fillna(0)
.rsub(v.cumsum())
.astype(int)
.tolist())
Output:
[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]
Maybe pandas is not the best tool for this as in the answer by #behzad.nouri, however here is another variation:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
z = df.ne(0).X
z.groupby((z != z.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
Name: X, dtype: int64
Solution 2:
If you write the following code you will get almost everything you need, except that the first row starts from 0 and not 1:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df.eq(0).cumsum().groupby('X').cumcount()
0 0
1 1
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
This happened because cumulative sum starts the counting from 0. To get the desired results, I added a 0 to the first row, calculated everything and then dropped the 0 at the end to get:
x = pd.Series([0], index=[0])
df = pd.concat([x, df])
df.eq(0).cumsum().groupby('X').cumcount().reset_index(drop=True).drop(0).reset_index(drop=True)
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64