Related
Given my dataframe df has 11 rows and 5 classes, as the following
import pandas as pd
df = pd.DataFrame( [[0,0,1,0,0],
[0,0,0,0,1],
[0,0,1,0,0],
[0,0,0,0,1],
[0,0,1,0,0],
[0,0,0,0,1],
[0,0,1,0,0],
[0,0,0,1,0],
[0,1,1,0,0],
[1,0,0,1,0],
[0,1,1,0,0],
[1,0,0,1,0],
[0,1,1,0,1]])
Note that the columns of df are [0,1,2,3,4] as default
When you run df.value_counts(), you got
0 1 2 3 4
0 0 1 0 0 4
0 0 1 3
1 1 0 0 2
1 0 0 1 0 2
0 0 0 1 0 1
1 1 0 1 1
you could observe that it returns all unique sequence of 5 classes with its count (the sparse element represent zero, I guess), Now I am wondering is there any possible way to get the index which contain each of these sequence of unique value in form of dictionary?
so, for this case, it could return the following dictionary where its key represent each unique sequence of class and its value represent the list of index.. like this
{(0,0,1,0,0): [0,2,4,6],
(0,0,0,0,1): [1,3,5],
(0,1,1,0,0): [8,10,12],
(1,0,0,1,0): [9,11],
(0,0,0,1,0) :[7],
(0,1,1,0,1): [1]}
Thank you in advance
you can simply did .groupby method
So, .groupby which take the list of columns as input will group all possible combination of every columns, and we can follow .groups to access your expected dictionary
df.groupby([0,1,2,3,4]).groups
Result:
{(0, 0, 0, 0, 1): [1, 3, 5], (0, 0, 0, 1, 0): [7], (0, 0, 1, 0, 0): [0, 2, 4, 6], (0, 1, 1, 0, 0): [8, 10], (0, 1, 1, 0, 1): [12], (1, 0, 0, 1, 0): [9, 11]}
I have a dataframe where I want to group rows based on a column. Some of the columns in the rows I want to sum up and the others I want to aggregate as a list.
#creating sample data
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['id'] = [1,2,1,4]
df['group'] = [[0,1,2,3] , [0,2,3,4], [1,1,1,1], 1]
df
Out[5]:
a b c d id group
0 0.850058 0.160497 0.742296 0.354296 1 [0, 1, 2, 3]
1 0.598759 0.399200 0.799157 0.908174 2 [0, 2, 3, 4]
2 0.160764 0.671702 0.414800 0.429992 1 [1, 1, 1, 1]
3 0.011089 0.581518 0.718829 0.610140 4 1
Here I want to combine row 0 and row 2 as they have the same id. When doing this, I want to sum up the values in columns a, b, c and d but for column group, I want the lists to be appended. How can I do this?
My expected output is:
a b c d id group
0 1.155671 1.670582 0.392744 0.681494 1 [0, 1, 2, 3, 1, 1, 1, 1]
1 0.598759 0.399200 0.799157 0.908174 2 [0, 2, 3, 4]
2 0.011089 0.581518 0.718829 0.610140 4 1
(When I use only the sum or df.groupby(['id'])['group'].apply(list), the other columns are dropped. )
Use groupby.aggregate
df.groupby('id').agg({k: sum for k in ['a', 'b', 'c', 'd', 'group']})
A one-liner alternative would be using numeric_only flag. But be careful with the columns you are feeding in.
df.groupby('id').sum(numeric_only=False)
Output
a b c d group
id
1 1.488778 0.802794 0.949768 0.952676 [0, 1, 2, 3, 1, 1, 1, 1]
2 0.488390 0.512301 0.064922 0.233875 [0, 2, 3, 4]
4 0.649945 0.267125 0.229313 0.156696 1
First Solution:
We can arrive at the task in 2 steps, the 1st step using GroupBy.sum to get the grouped sum of the first 4 columns. The 2nd step acting on the column group only and concat the lists also by GroupBy.sum
df.groupby('id').sum().join(df.groupby('id')['group'].sum()).reset_index()
Input (Different values owing to the different random numbers generated)
a b c d id group
0 0.758148 0.781987 0.310849 0.600912 1 [0, 1, 2, 3]
1 0.694848 0.755622 0.947359 0.708422 2 [0, 2, 3, 4]
2 0.515446 0.454484 0.169883 0.697287 1 [1, 1, 1, 1]
3 0.361939 0.325718 0.143510 0.077142 4 1
Output:
id a b c d group
0 1 1.273594 1.236471 0.480732 1.298199 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.694848 0.755622 0.947359 0.708422 [0, 2, 3, 4]
2 4 0.361939 0.325718 0.143510 0.077142 1
Second Solution
We can also use GroupBy.agg with named aggegation, as follows:
df.groupby('id', as_index=False).agg(a=('a', 'sum'), b=('b', 'sum'), c=('c', 'sum'), d=('d', 'sum'), group=('group', 'sum'))
Result:
id a b c d group
0 1 1.273594 1.236471 0.480732 1.298199 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.694848 0.755622 0.947359 0.708422 [0, 2, 3, 4]
2 4 0.361939 0.325718 0.143510 0.077142 1
Does this work:
pd.merge(df.groupby('id', as_index = False).sum(), df.groupby('id')['group'].apply(sum).reset_index(), on = 'id')
id a b c d group
0 1 1.241602 0.839409 0.779673 0.639509 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.967984 0.838906 0.313017 0.498611 [0, 2, 3, 4]
2 4 0.042871 0.367209 0.676656 0.178939 1
Let's say I have a Pandas series like so:
import pandas as pd
pd.Series([1, 0, 0, 1, 0, 0, 0], name='series')
How would I add a column with a row count since the last >0 number, like so:
pd.DataFrame({
'series': [1, 0, 0, 1, 0, 0, 0],
'row_num': [0, 1, 2, 0, 1, 2, 3]
})
Try this:
s.groupby(s.cumsum()).cumcount()
Output:
0 0
1 1
2 2
3 0
4 1
5 2
6 3
dtype: int64
Numpy
Find the places where the series/array is greater than 0
Calculate the differences from one place to the next
Subtract those values from a sequence
i = np.flatnonzero(s)
n = len(s)
delta = np.diff(np.append(i, n))
r = np.arange(n)
r - r[i].repeat(delta)
array([0, 1, 2, 0, 1, 2, 3])
Let's say I have a machine that sends out 4 bits every second and I want to see the amount of times a certain bit signature is sent over time.
I am given an input list of lists that contain a message in bits that change over time.
For my output I would like a list of dictionaries, per bit pair, containing the unique bit pair as the key and the times it appears as the value.
Edit New Example:
For example this following data set would be a representation of that data. With the horizontal axis being bit position and the vertical axis being samples over time. So for the following example I have 4 total bits and 6 total samples.
a = [
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1],
[1, 1, 1, 1],
[0, 0, 0, 0],
[1, 0, 1, 0]])
For this data set I am trying to get a count of how many times a certain bit string occurs this length should be able to vary but for this example let's say I am doing 2 bits at a time.
So the first sample [0,0,1,1] would be split into this
[00,01,11] and the second would be [01,11,11] and the third would be [11,11,11] and so on. Producing a list like so:
y = [
[00,01,11],
[01,11,11],
[11,11,11],
[11,11,11],
[00,00,00],
[10,01,10]]
From this I want to be able to count each unique signature and to produce a dictionary with keys corresponding to the signature and values to the counts.
The dictionary would like this
z = [
{'00':2, '01':1, '11':2, '10':1},
{'00':1, '01':2, '11':3},
{'00':1, '11':4], '10':1}]
Finding the counts is easy if a have a list of parsed items. However getting from the raw data to that parsed list is where I am currently having some trouble. I have an implementation but it's essentially 3 for loops and it runs really slow over large dataset. Surely there is a better and more pythonic way to get about this?
I am using numpy for some additional calculation later on in my program so I would not be against using it here.
UPDATE:
I have been looking around at other things and came to this. Not sure if this is the best solution either.
import numpy as np
a = np.array([
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]])
my_list = a.astype(str).tolist()
# How many elements each
# list should have
n = 2
# using list comprehension
final = [([''.join(c[i:(i) + n]) for i in range((len(c) + n) // n)]) for c in my_list]
final = [['00', '01', '11'], ['01', '11', '11'], ['11', '11', '11']]
UPDATE 2:
I have ran the following implementations and tested there speeds and here is what I have came up with.
Running the data on the small example of 4 bits and 4 samples with a width of 2.
x = [
[0,0,1,1],
[0,1,1,1],
[1,1,1,1]]
My implementation took 0.0003 seconds
Kasrâmvd's implementation took 0.0002 seconds
Chris' implementation took 0.0002 seconds
Paul's implementation took 0.0243 seconds
However when running against an actual dataset of 64 bits and 23,497 samples with a width of 2. I got these results:
My implementation took 1.5302 seconds
Kasrâmvd's implementation took 0.3913 seconds
Chris' Implementation took 2.0802 seconds
Paul's implementation took 0.0204 seconds
Here is an approach using convolution. As fast convolution depends on FFT and therefore needs to do computations with floats, we have 52 bits mantissa and 53 is the maximum pattern length we can handle.
import itertools as it
import numpy as np
import scipy.signal as ss
MAX_BITS = np.finfo(float).nmant + 1
def sliding_window(data, width, return_keys=True, return_dict=True, prune_empty=True):
n, m = data.shape
if width > MAX_BITS:
raise ValueError(f"max window width is {MAX_BITS}")
patterns = ss.convolve(data, 1<<np.arange(width)[None], 'valid', 'auto').astype(int)
patterns += np.arange(m-width+1)*(1<<width)
cnts = np.bincount(patterns.ravel(), None, (m-width+1)*(1<<width)).reshape(m-width+1,-1)
if return_keys or return_dict:
keys = np.array([*map("".join, it.product(*width*("01",)))], 'S')
if return_dict:
dt = np.dtype([('key', f'S{width}'), ('value', int)])
aux = np.empty(cnts.shape, dt)
aux['value'] = cnts
aux['key'] = keys
if prune_empty:
i,j = np.where(cnts)
return [*map(dict, np.split(aux[i,j],
i.searchsorted(np.arange(1,m-width+1))))]
return [*map(dict, aux.tolist())]
return keys, cnts
return cnts
example = np.random.randint(0, 2, (10,10))
print(example)
print(sliding_window(example,3))
Sample run:
[[0 1 1 1 0 1 1 1 1 1]
[0 0 1 0 1 0 0 1 0 1]
[0 0 1 0 1 1 1 0 1 1]
[1 1 1 1 1 0 0 0 1 0]
[0 0 0 0 1 1 1 0 0 0]
[1 1 0 0 0 1 0 0 1 1]
[0 1 1 1 0 1 1 1 1 1]
[0 1 0 0 0 1 1 0 0 1]
[1 0 1 1 0 1 1 0 1 0]
[0 0 1 1 0 1 0 1 0 0]]
[{b'000': 1, b'001': 3, b'010': 1, b'011': 2, b'101': 1, b'110': 1, b'111': 1}, {b'000': 1, b'010': 2, b'011': 2, b'100': 2, b'111': 3}, {b'000': 2, b'001': 1, b'101': 2, b'110': 4, b'111': 1}, {b'001': 2, b'010': 1, b'011': 2, b'101': 4, b'110': 1}, {b'010': 2, b'011': 4, b'100': 2, b'111': 2}, {b'000': 1, b'001': 1, b'100': 1, b'101': 1, b'110': 4, b'111': 2}, {b'001': 2, b'010': 2, b'100': 2, b'101': 2, b'111': 2}, {b'000': 1, b'001': 1, b'010': 2, b'011': 2, b'100': 1, b'101': 1, b'111': 2}]
If you wanna have a geometrical or algebraic analysis/solution you can do the following:
In [108]: x = np.array([[0,0,1,1],
...: [0,1,1,1],
...: [1,1,1,1]])
...:
In [109]:
In [109]: pairs = np.dstack((x[:, :-1], x[:, 1:]))
In [110]: x, y, z = pairs.shape
In [111]: uniques
Out[111]:
array([[0, 0],
[0, 1],
[1, 1]])
In [112]: uniques = np.unique(pairs.reshape(x*y, z), axis=0)
# None: 3d broadcasting is not recommended in any situation, please read doc for more details,
In [113]: R = (uniques[:,None][:,None,:] == pairs).all(3).sum(-1)
In [114]: R
Out[114]:
array([[1, 0, 0],
[1, 1, 0],
[1, 2, 3]])
The columns of matrix R stand for the count of each unique pair in uniques object in each row of your original array.
You can then get a Python object like what you want as following:
In [116]: [{tuple(i): j for i,j in zip(uniques, i) if j} for i in R.T]
Out[116]: [{(0, 0): 1, (0, 1): 1, (1, 1): 1}, {(0, 1): 1, (1, 1): 2}, {(1, 1): 3}]
This solution doesn't pair the bits, but gives them as tuples (although that should be simple enough to do).
EDIT: formed strings of bits as needed.
from collections import Counter
x = [[0,0,1,1],
[0,1,1,1],
[1,1,1,1]]
y = [[''.join(map(str, ref[j:j+2])) for j in range(len(x[0])-1)] \
for ref in x]
for bit in y:
d = Counter(bit)
print(d)
Prints
Counter({'00': 1, '01': 1, '11': 1})
Counter({'11': 2, '01': 1})
Counter({'11': 3})
EDIT: To increase the window from 2 to 3, you might add this to your code:
window = 3
offset = window - 1
y = [[''.join(map(str, ref[j:j+window])) for j in range(len(x[0])-offset)] \
for ref in x]
I have the following pandas series (represented as a list):
[7,2,0,3,4,2,5,0,3,4]
I would like to define a new series that returns distance to the last zero. It means that I would like to have the following output:
[1,2,0,1,2,3,4,0,1,2]
How to do it in pandas in the most efficient way?
The complexity is O(n). What will slow it down is doing a for loop in python. If there are k zeros in the series, and log k is negligibile comparing to the length of series, an O(n log k) solution would be:
>>> izero = np.r_[-1, (ts == 0).nonzero()[0]] # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])
A solution in Pandas is a little bit tricky, but could look like this (s is your Series):
>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
For the last step, this uses the "itertools.groupby" recipe in the Pandas cookbook here.
A solution that may not be as performant (haven't really checked), but easier to understand in terms of the steps (at least for me), would be:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df
df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']
df
It's sometimes surprising to see how simple it is to get c-like speeds for this stuff using Cython. Assuming your column's .values gives arr, then:
cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret
cdef int i, zero_count = 0
for i in range(len(ret)):
zero_count = 0 if arr_view[i] == 0 else zero_count + 1
ret_view[i] = zero_count
Note the use of typed memory views, which are extremely fast. You can speed it further using #cython.boundscheck(False) decorating a function using this.
Another option
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]
def d0(a):
return np.min(a[a>=0])
df.index.to_series().apply(lambda i: d0(i - zeros))
Or using pure numpy
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]
np.min(a, where=a>=0, axis=1, initial=len(df))
Yet another way to do this using Numpy accumulate. The only catch is, to initialize the counter at zero you need to insert a zero infront of the series values.
import numpy as np
# Define Python function
f = lambda a, b: 0 if b == 0 else a + 1
# Convert to Numpy ufunc
npf = np.frompyfunc(f, 2, 1)
# Apply recursively over series values
x = npf.accumulate(np.r_[0, s.values])[1:]
print(x)
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2], dtype=object)
Here is a way without using groupby:
((v:=pd.Series([7,2,0,3,4,2,5,0,3,4]).ne(0))
.cumsum()
.where(v.eq(0)).ffill().fillna(0)
.rsub(v.cumsum())
.astype(int)
.tolist())
Output:
[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]
Maybe pandas is not the best tool for this as in the answer by #behzad.nouri, however here is another variation:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
z = df.ne(0).X
z.groupby((z != z.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
Name: X, dtype: int64
Solution 2:
If you write the following code you will get almost everything you need, except that the first row starts from 0 and not 1:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df.eq(0).cumsum().groupby('X').cumcount()
0 0
1 1
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
This happened because cumulative sum starts the counting from 0. To get the desired results, I added a 0 to the first row, calculated everything and then dropped the 0 at the end to get:
x = pd.Series([0], index=[0])
df = pd.concat([x, df])
df.eq(0).cumsum().groupby('X').cumcount().reset_index(drop=True).drop(0).reset_index(drop=True)
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64