Create dataset from another basing on first occurrence of some number - python

I have some dataset which looks like [3,4,5,-5,4,5,6,3,2-6,6]
I want to create a dataset that will always have 0 for indexes which match first sequence of positive numbers from dataset 1, and 1 for indexes which remain.
So for a = [3,4,5,-5,4,5,6,3,2-6,6] it should be
b = [0,0,0, 1,1,1,1,1,1,1]
How can produce b from a if I use pandas and python ?

Since you tagged pandas, here is a solution using a Series:
import pandas as pd
s = pd.Series([3, 4, 5, -5, 4, 5, 6, 3, 2 - 6, 6])
# find the first index that is greater than zero
idx = (s > 0).idxmin()
# using the index set all the values before as 0, otherwise 1
res = pd.Series(s.index >= idx, dtype=int)
print(res)
Output
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
If you prefer a one-liner:
res = pd.Series(s.index >= (s > 0).idxmin(), dtype=int)

You can use a cummax on the boolean series:
s = pd.Series([3, 4, 5, -5, 4, 5, 6, 3, 2 - 6, 6])
out = s.lt(0).cummax().astype(int)
Output:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
If you are really working with lists, then pandas is not needed and numpy should be more efficient:
import numpy as np
a = [3,4,5,-5,4,5,6,3,2-6,6]
b = np.maximum.accumulate(np.array(a)<0).astype(int).tolist()
Output: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
And if the list is small, pure python should be preferred:
from itertools import accumulate
b = list(accumulate((int(x<0) for x in a), max))
Output: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

Related

Count how many consecutive rows meet a condition with pandas

I have a table like this:
import pandas as pd
df = pd.DataFrame({
"day": [1, 2, 3, 4, 5, 6],
"tmin": [-2, -3, -1, -4, -4, -2]
})
I want to create a column like this:
df['days_under_0_until_now'] = [1, 2, 3, 4, 5, 6]
df['days_under_-2_until_now'] = [1, 2, 0, 1, 2, 3]
df['days_under_-3_until_now'] = [0, 1, 0, 1, 2, 0]
So days_under_X_until_now means how many consecutive days until now tmin was under or equals X
I'd like to avoid do this with loops since the data is huge. Is there an alternative way to do it?
For improve performance avoid using groupby compare values of column to list and then use this solution for count consecutive Trues:
vals = [0,-2,-3]
arr = df['tmin'].to_numpy()[:, None] <= np.array(vals)[ None, :]
cols = [f'days_under_{v}_until_now' for v in vals]
df1 = pd.DataFrame(arr, columns=cols, index=df.index)
b = df1.cumsum()
df = df.join(b.sub(b.mask(df1).ffill().fillna(0)).astype(int))
print (df)
day tmin days_under_0_until_now days_under_-2_until_now \
0 1 -2 1 1
1 2 -3 2 2
2 3 -1 3 0
3 4 -4 4 1
4 5 -4 5 2
5 6 -2 6 3
days_under_-3_until_now
0 0
1 1
2 0
3 1
4 2
5 0

How to apply different functions to different columns after groupby like sum and .apply(list)? (Python)

I have a dataframe where I want to group rows based on a column. Some of the columns in the rows I want to sum up and the others I want to aggregate as a list.
#creating sample data
df = pd.DataFrame(np.random.rand(4,4), columns=list('abcd'))
df['id'] = [1,2,1,4]
df['group'] = [[0,1,2,3] , [0,2,3,4], [1,1,1,1], 1]
df
Out[5]:
a b c d id group
0 0.850058 0.160497 0.742296 0.354296 1 [0, 1, 2, 3]
1 0.598759 0.399200 0.799157 0.908174 2 [0, 2, 3, 4]
2 0.160764 0.671702 0.414800 0.429992 1 [1, 1, 1, 1]
3 0.011089 0.581518 0.718829 0.610140 4 1
Here I want to combine row 0 and row 2 as they have the same id. When doing this, I want to sum up the values in columns a, b, c and d but for column group, I want the lists to be appended. How can I do this?
My expected output is:
a b c d id group
0 1.155671 1.670582 0.392744 0.681494 1 [0, 1, 2, 3, 1, 1, 1, 1]
1 0.598759 0.399200 0.799157 0.908174 2 [0, 2, 3, 4]
2 0.011089 0.581518 0.718829 0.610140 4 1
(When I use only the sum or df.groupby(['id'])['group'].apply(list), the other columns are dropped. )
Use groupby.aggregate
df.groupby('id').agg({k: sum for k in ['a', 'b', 'c', 'd', 'group']})
A one-liner alternative would be using numeric_only flag. But be careful with the columns you are feeding in.
df.groupby('id').sum(numeric_only=False)
Output
a b c d group
id
1 1.488778 0.802794 0.949768 0.952676 [0, 1, 2, 3, 1, 1, 1, 1]
2 0.488390 0.512301 0.064922 0.233875 [0, 2, 3, 4]
4 0.649945 0.267125 0.229313 0.156696 1
First Solution:
We can arrive at the task in 2 steps, the 1st step using GroupBy.sum to get the grouped sum of the first 4 columns. The 2nd step acting on the column group only and concat the lists also by GroupBy.sum
df.groupby('id').sum().join(df.groupby('id')['group'].sum()).reset_index()
Input (Different values owing to the different random numbers generated)
a b c d id group
0 0.758148 0.781987 0.310849 0.600912 1 [0, 1, 2, 3]
1 0.694848 0.755622 0.947359 0.708422 2 [0, 2, 3, 4]
2 0.515446 0.454484 0.169883 0.697287 1 [1, 1, 1, 1]
3 0.361939 0.325718 0.143510 0.077142 4 1
Output:
id a b c d group
0 1 1.273594 1.236471 0.480732 1.298199 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.694848 0.755622 0.947359 0.708422 [0, 2, 3, 4]
2 4 0.361939 0.325718 0.143510 0.077142 1
Second Solution
We can also use GroupBy.agg with named aggegation, as follows:
df.groupby('id', as_index=False).agg(a=('a', 'sum'), b=('b', 'sum'), c=('c', 'sum'), d=('d', 'sum'), group=('group', 'sum'))
Result:
id a b c d group
0 1 1.273594 1.236471 0.480732 1.298199 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.694848 0.755622 0.947359 0.708422 [0, 2, 3, 4]
2 4 0.361939 0.325718 0.143510 0.077142 1
Does this work:
pd.merge(df.groupby('id', as_index = False).sum(), df.groupby('id')['group'].apply(sum).reset_index(), on = 'id')
id a b c d group
0 1 1.241602 0.839409 0.779673 0.639509 [0, 1, 2, 3, 1, 1, 1, 1]
1 2 0.967984 0.838906 0.313017 0.498611 [0, 2, 3, 4]
2 4 0.042871 0.367209 0.676656 0.178939 1

delete elements from a data frame w.r.t columns of another data frame

I have a data frame say df1 with MULTILEVEL INDEX:
A B C D
0 0 0 1 2 3
4 5 6 7
1 2 8 9 10 11
3 2 3 4 5
and I have another data frame with 2 common columns in df2 also with MULTILEVEL INDEX
X B C Y
0 0 0 0 7 3
1 4 5 6 7
1 2 8 2 3 11
3 2 3 4 5
I need to remove the rows from df1 where the values of column B and C are the same as in df2, so I should be getting something like this:
A B C D
0 0 0 1 2 3
0 2 8 9 10 11
I have tried to do this by getting the index of the common elements and then remove them via a list, but they are all messed up and are in multi-level form.
You can do this in a one liner using pandas.dataframe.iloc, numpy.where and numpy.logical_or like this: (I find it to be the simplest way)
df1 = df1.iloc[np.where(np.logical_or(df1['B']!=df2['B'],df1['C']!=df2['C']))]
of course don't forget to:
import numpy as np
output:
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
Hope this was helpful. If there are any questions or remarks please feel free to comment.
You could make MultiIndexes out of the B and C columns, and then call the index's isin method:
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
For example,
import pandas as pd
df1 = pd.DataFrame({'A': [0, 4, 8, 2], 'B': [1, 5, 9, 3], 'C': [2, 6, 10, 4], 'D': [3, 7, 11, 5], 'P': [0, 0, 1, 1], 'Q': [0, 0, 2, 3]})
df1 = df1.set_index(list('PQ'))
df1.index.names = [None,None]
df2 = pd.DataFrame({'B': [0, 5, 2, 3], 'C': [7, 6, 3, 4], 'P': [0, 0, 1, 1], 'Q': [0, 1, 2, 3], 'X': [0, 4, 8, 2], 'Y': [3, 7, 11, 5]})
df2 = df2.set_index(list('PQ'))
df2.index.names = [None,None]
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
print(result)
yields
A B C D
0 0 0 1 2 3
1 2 8 9 10 11

How to replace only the first n elements in a numpy array that are larger than a certain value?

I have an array myA like this:
array([ 7, 4, 5, 8, 3, 10])
If I want to replace all values that are larger than a value val by 0, I can simply do:
myA[myA > val] = 0
which gives me the desired output (for val = 5):
array([0, 4, 5, 0, 3, 0])
However, my goal is to replace not all but only the first n elements of this array that are larger than a value val.
So, if n = 2 my desired outcome would look like this (10 is the third element and should therefore not been replaced):
array([ 0, 4, 5, 0, 3, 10])
A straightforward implementation would be:
import numpy as np
myA = np.array([7, 4, 5, 8, 3, 10])
n = 2
val = 5
# track the number of replacements
repl = 0
for ind, vali in enumerate(myA):
if vali > val:
myA[ind] = 0
repl += 1
if repl == n:
break
That works but maybe someone can can up with a smart way of masking!?
The following should work:
myA[(myA > val).nonzero()[0][:2]] = 0
since nonzero will return the indexes where the boolean array myA > val is non zero e.g. True.
For example:
In [1]: myA = array([ 7, 4, 5, 8, 3, 10])
In [2]: myA[(myA > 5).nonzero()[0][:2]] = 0
In [3]: myA
Out[3]: array([ 0, 4, 5, 0, 3, 10])
Final solution is very simple:
import numpy as np
myA = np.array([7, 4, 5, 8, 3, 10])
n = 2
val = 5
myA[np.where(myA > val)[0][:n]] = 0
print(myA)
Output:
[ 0 4 5 0 3 10]
Here's another possibility (untested), probably no better than nonzero:
def truncate_mask(m, stop):
m = m.astype(bool, copy=False) # if we allow non-bool m, the next line becomes nonsense
return m & (np.cumsum(m) <= stop)
myA[truncate_mask(myA > val, n)] = 0
By avoiding building and using an explicit index you might end up with slightly better performance...but you'd have to test it to find out.
Edit 1: while we're on the subject of possibilities, you could also try:
def truncate_mask(m, stop):
m = m.astype(bool, copy=True) # note we need to copy m here to safely modify it
m[np.searchsorted(np.cumsum(m), stop):] = 0
return m
Edit 2 (the next day): I've just tested this and it seems that cumsum is actually worse than nonzero, at least with the kinds of values I was using (so neither of the above approaches is worth using). Out of curiosity, I also tried it with numba:
import numba
#numba.jit
def set_first_n_gt_thresh(a, val, thresh, n):
ii = 0
while n>0 and ii < len(a):
if a[ii] > thresh:
a[ii] = val
n -= 1
ii += 1
This only iterates over the array once, or rather it only iterates over the necessary part of the array once, never even touching the latter part. This gives you vastly superior performance for small n, but even for the worst case of n>=len(a) this approach is faster.
You could use the same solution as here with converting you np.array to pd.Series:
s = pd.Series([ 7, 4, 5, 8, 3, 10])
n = 2
m = 5
s[s[s>m].iloc[:n].index] = 0
In [416]: s
Out[416]:
0 0
1 4
2 5
3 0
4 3
5 10
dtype: int64
Step by step explanation:
In [426]: s > m
Out[426]:
0 True
1 False
2 False
3 True
4 False
5 True
dtype: bool
In [428]: s[s>m].iloc[:n]
Out[428]:
0 7
3 8
dtype: int64
In [429]: s[s>m].iloc[:n].index
Out[429]: Int64Index([0, 3], dtype='int64')
In [430]: s[s[s>m].iloc[:n].index]
Out[430]:
0 7
3 8
dtype: int64
Output in In[430] looks the same as In[428] but in 428 it's a copy and in 430 original series.
If you'll need np.array you could use values method:
In [418]: s.values
Out[418]: array([ 0, 4, 5, 0, 3, 10], dtype=int64)

How to count distance to the previous zero in pandas series?

I have the following pandas series (represented as a list):
[7,2,0,3,4,2,5,0,3,4]
I would like to define a new series that returns distance to the last zero. It means that I would like to have the following output:
[1,2,0,1,2,3,4,0,1,2]
How to do it in pandas in the most efficient way?
The complexity is O(n). What will slow it down is doing a for loop in python. If there are k zeros in the series, and log k is negligibile comparing to the length of series, an O(n log k) solution would be:
>>> izero = np.r_[-1, (ts == 0).nonzero()[0]] # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])
A solution in Pandas is a little bit tricky, but could look like this (s is your Series):
>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
For the last step, this uses the "itertools.groupby" recipe in the Pandas cookbook here.
A solution that may not be as performant (haven't really checked), but easier to understand in terms of the steps (at least for me), would be:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df
df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']
df
It's sometimes surprising to see how simple it is to get c-like speeds for this stuff using Cython. Assuming your column's .values gives arr, then:
cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret
cdef int i, zero_count = 0
for i in range(len(ret)):
zero_count = 0 if arr_view[i] == 0 else zero_count + 1
ret_view[i] = zero_count
Note the use of typed memory views, which are extremely fast. You can speed it further using #cython.boundscheck(False) decorating a function using this.
Another option
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]
def d0(a):
return np.min(a[a>=0])
df.index.to_series().apply(lambda i: d0(i - zeros))
Or using pure numpy
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]
np.min(a, where=a>=0, axis=1, initial=len(df))
Yet another way to do this using Numpy accumulate. The only catch is, to initialize the counter at zero you need to insert a zero infront of the series values.
import numpy as np
# Define Python function
f = lambda a, b: 0 if b == 0 else a + 1
# Convert to Numpy ufunc
npf = np.frompyfunc(f, 2, 1)
# Apply recursively over series values
x = npf.accumulate(np.r_[0, s.values])[1:]
print(x)
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2], dtype=object)
Here is a way without using groupby:
((v:=pd.Series([7,2,0,3,4,2,5,0,3,4]).ne(0))
.cumsum()
.where(v.eq(0)).ffill().fillna(0)
.rsub(v.cumsum())
.astype(int)
.tolist())
Output:
[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]
Maybe pandas is not the best tool for this as in the answer by #behzad.nouri, however here is another variation:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
z = df.ne(0).X
z.groupby((z != z.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
Name: X, dtype: int64
Solution 2:
If you write the following code you will get almost everything you need, except that the first row starts from 0 and not 1:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df.eq(0).cumsum().groupby('X').cumcount()
0 0
1 1
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
This happened because cumulative sum starts the counting from 0. To get the desired results, I added a 0 to the first row, calculated everything and then dropped the 0 at the end to get:
x = pd.Series([0], index=[0])
df = pd.concat([x, df])
df.eq(0).cumsum().groupby('X').cumcount().reset_index(drop=True).drop(0).reset_index(drop=True)
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64

Categories