Related
I have some dataset which looks like [3,4,5,-5,4,5,6,3,2-6,6]
I want to create a dataset that will always have 0 for indexes which match first sequence of positive numbers from dataset 1, and 1 for indexes which remain.
So for a = [3,4,5,-5,4,5,6,3,2-6,6] it should be
b = [0,0,0, 1,1,1,1,1,1,1]
How can produce b from a if I use pandas and python ?
Since you tagged pandas, here is a solution using a Series:
import pandas as pd
s = pd.Series([3, 4, 5, -5, 4, 5, 6, 3, 2 - 6, 6])
# find the first index that is greater than zero
idx = (s > 0).idxmin()
# using the index set all the values before as 0, otherwise 1
res = pd.Series(s.index >= idx, dtype=int)
print(res)
Output
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
If you prefer a one-liner:
res = pd.Series(s.index >= (s > 0).idxmin(), dtype=int)
You can use a cummax on the boolean series:
s = pd.Series([3, 4, 5, -5, 4, 5, 6, 3, 2 - 6, 6])
out = s.lt(0).cummax().astype(int)
Output:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
If you are really working with lists, then pandas is not needed and numpy should be more efficient:
import numpy as np
a = [3,4,5,-5,4,5,6,3,2-6,6]
b = np.maximum.accumulate(np.array(a)<0).astype(int).tolist()
Output: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
And if the list is small, pure python should be preferred:
from itertools import accumulate
b = list(accumulate((int(x<0) for x in a), max))
Output: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
Based on the following sample data, the following data frame is built:
day = [1, 2, 3, 2, 3, 1, 2]
item_id = [1, 1, 1, 2, 2, 3, 3]
item_name = ['A', 'A', 'A', 'B', 'B', 'C', 'C']
increase = [4, 0, 4, 3, 3, 3, 3]
decrease = [2, 2, 2, 1, 1, 1, 1]
my_df = pd.DataFrame(list(zip(day, item_id, item_name, increase, decrease)),
columns=['day', 'item_id', 'item_name', 'increase', 'decrease'])
my_df = my_df.set_index(['item_id', 'item_name'])
I would like to create two new columns:
starting_quantity[0] would have each initial value set to 0 for the index (or multi-index)
ending_quantity adds the increase and subtracts the decrease
starting_quantity[1, 2, 3, ...] is equal to the ending_quantity of the previous day.
The output I'd like to create is as follows:
I appreciate if you could assist with any or all of the 3 steps above!
Try:
my_df = my_df.set_index(["item_id", "item_name"])
g = my_df.groupby(level=0)
my_df["tmp"] = my_df["increase"] - my_df["decrease"]
my_df["starting_quantity"] = g["tmp"].shift().fillna(0)
my_df["starting_quantity"] = g["starting_quantity"].cumsum().astype(int)
my_df["ending_quantity"] = g["tmp"].cumsum()
my_df = my_df.drop(columns="tmp")
print(my_df)
Prints:
day increase decrease starting_quantity ending_quantity
item_id item_name
1 A 1 4 2 0 2
A 2 0 2 2 0
A 3 4 2 0 2
2 B 2 3 1 0 2
B 3 3 1 2 4
3 C 1 3 1 0 2
C 2 3 1 2 4
Obtaining the second maximum value for each row in a data frame, but getting value error
column = [col for col in dataframe.columns if '%' in col]
dataframe["Max_2nd"] = dataframe[column].apply(lambda row: row.nlargest(2).values[-1],axis=1)
How can I resolve this
So, given the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
(
{
"col1": [1, 2, 3, 4, 5, 6, 7, 8],
"col2": [3, 0, 1, 3, 0, 1, 8, 5],
"col3": [7, 9, 2, 6, 7, 8, 0, 1],
"col4": [0, 4, 5, 0, 4, 3, 4, 0],
}
)
)
You can find the second max value in each row like this:
df["Max_2nd"] = df.apply(lambda x: sorted(x, reverse=True)[1], axis=1)
print(df)
# Outputs
col1 col2 col3 col4 Max_2nd
0 1 3 7 0 3
1 2 0 9 4 4
2 3 1 2 5 3
3 4 3 6 0 4
4 5 0 7 4 5
5 6 1 8 3 6
6 7 8 0 4 7
7 8 5 1 0 5
I have a data frame say df1 with MULTILEVEL INDEX:
A B C D
0 0 0 1 2 3
4 5 6 7
1 2 8 9 10 11
3 2 3 4 5
and I have another data frame with 2 common columns in df2 also with MULTILEVEL INDEX
X B C Y
0 0 0 0 7 3
1 4 5 6 7
1 2 8 2 3 11
3 2 3 4 5
I need to remove the rows from df1 where the values of column B and C are the same as in df2, so I should be getting something like this:
A B C D
0 0 0 1 2 3
0 2 8 9 10 11
I have tried to do this by getting the index of the common elements and then remove them via a list, but they are all messed up and are in multi-level form.
You can do this in a one liner using pandas.dataframe.iloc, numpy.where and numpy.logical_or like this: (I find it to be the simplest way)
df1 = df1.iloc[np.where(np.logical_or(df1['B']!=df2['B'],df1['C']!=df2['C']))]
of course don't forget to:
import numpy as np
output:
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
Hope this was helpful. If there are any questions or remarks please feel free to comment.
You could make MultiIndexes out of the B and C columns, and then call the index's isin method:
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
For example,
import pandas as pd
df1 = pd.DataFrame({'A': [0, 4, 8, 2], 'B': [1, 5, 9, 3], 'C': [2, 6, 10, 4], 'D': [3, 7, 11, 5], 'P': [0, 0, 1, 1], 'Q': [0, 0, 2, 3]})
df1 = df1.set_index(list('PQ'))
df1.index.names = [None,None]
df2 = pd.DataFrame({'B': [0, 5, 2, 3], 'C': [7, 6, 3, 4], 'P': [0, 0, 1, 1], 'Q': [0, 1, 2, 3], 'X': [0, 4, 8, 2], 'Y': [3, 7, 11, 5]})
df2 = df2.set_index(list('PQ'))
df2.index.names = [None,None]
idx1 = pd.MultiIndex.from_arrays([df1['B'],df1['C']])
idx2 = pd.MultiIndex.from_arrays([df2['B'],df2['C']])
mask = idx1.isin(idx2)
result = df1.loc[~mask]
print(result)
yields
A B C D
0 0 0 1 2 3
1 2 8 9 10 11
I have the following pandas series (represented as a list):
[7,2,0,3,4,2,5,0,3,4]
I would like to define a new series that returns distance to the last zero. It means that I would like to have the following output:
[1,2,0,1,2,3,4,0,1,2]
How to do it in pandas in the most efficient way?
The complexity is O(n). What will slow it down is doing a for loop in python. If there are k zeros in the series, and log k is negligibile comparing to the length of series, an O(n log k) solution would be:
>>> izero = np.r_[-1, (ts == 0).nonzero()[0]] # indices of zeros
>>> idx = np.arange(len(ts))
>>> idx - izero[np.searchsorted(izero - 1, idx) - 1]
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2])
A solution in Pandas is a little bit tricky, but could look like this (s is your Series):
>>> x = (s != 0).cumsum()
>>> y = x != x.shift()
>>> y.groupby((y != y.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
For the last step, this uses the "itertools.groupby" recipe in the Pandas cookbook here.
A solution that may not be as performant (haven't really checked), but easier to understand in terms of the steps (at least for me), would be:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df
df['flag'] = np.where(df['X'] == 0, 0, 1)
df['cumsum'] = df['flag'].cumsum()
df['offset'] = df['cumsum']
df.loc[df.flag==1, 'offset'] = np.nan
df['offset'] = df['offset'].fillna(method='ffill').fillna(0).astype(int)
df['final'] = df['cumsum'] - df['offset']
df
It's sometimes surprising to see how simple it is to get c-like speeds for this stuff using Cython. Assuming your column's .values gives arr, then:
cdef int[:, :, :] arr_view = arr
ret = np.zeros_like(arr)
cdef int[:, :, :] ret_view = ret
cdef int i, zero_count = 0
for i in range(len(ret)):
zero_count = 0 if arr_view[i] == 0 else zero_count + 1
ret_view[i] = zero_count
Note the use of typed memory views, which are extremely fast. You can speed it further using #cython.boundscheck(False) decorating a function using this.
Another option
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
zeros = np.r_[-1, np.where(df.X == 0)[0]]
def d0(a):
return np.min(a[a>=0])
df.index.to_series().apply(lambda i: d0(i - zeros))
Or using pure numpy
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
a = np.arange(len(df))[:, None] - np.r_[-1 , np.where(df.X == 0)[0]][None]
np.min(a, where=a>=0, axis=1, initial=len(df))
Yet another way to do this using Numpy accumulate. The only catch is, to initialize the counter at zero you need to insert a zero infront of the series values.
import numpy as np
# Define Python function
f = lambda a, b: 0 if b == 0 else a + 1
# Convert to Numpy ufunc
npf = np.frompyfunc(f, 2, 1)
# Apply recursively over series values
x = npf.accumulate(np.r_[0, s.values])[1:]
print(x)
array([1, 2, 0, 1, 2, 3, 4, 0, 1, 2], dtype=object)
Here is a way without using groupby:
((v:=pd.Series([7,2,0,3,4,2,5,0,3,4]).ne(0))
.cumsum()
.where(v.eq(0)).ffill().fillna(0)
.rsub(v.cumsum())
.astype(int)
.tolist())
Output:
[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]
Maybe pandas is not the best tool for this as in the answer by #behzad.nouri, however here is another variation:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
z = df.ne(0).X
z.groupby((z != z.shift()).cumsum()).cumsum()
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
Name: X, dtype: int64
Solution 2:
If you write the following code you will get almost everything you need, except that the first row starts from 0 and not 1:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df.eq(0).cumsum().groupby('X').cumcount()
0 0
1 1
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64
This happened because cumulative sum starts the counting from 0. To get the desired results, I added a 0 to the first row, calculated everything and then dropped the 0 at the end to get:
x = pd.Series([0], index=[0])
df = pd.concat([x, df])
df.eq(0).cumsum().groupby('X').cumcount().reset_index(drop=True).drop(0).reset_index(drop=True)
0 1
1 2
2 0
3 1
4 2
5 3
6 4
7 0
8 1
9 2
dtype: int64