Here is my simplified example dataframe:
timestamp A B
1422404668 1 1
1422404670 2 2
1422404672 -3 3
1422404674 -4 4
1422404676 5 5
1422404678 -6 6
1422404680 -7 7
1422404680 8 8
Is there a way to groupby/filter only positive and negative values and get first value of each group in column A and mean values of column B as below output
Expected output:
timestamp A B
1422404668 1 3
1422404672 -3 7
1422404676 5 5
1422404678 -6 13
1422404680 8 8
Data:
{'timestamp': [1422404668, 1422404670, 1422404672, 1422404674,
1422404676, 1422404678, 1422404680, 1422404680],
'A': [1, 2, -3, -4, 5, -6, -7, 8], 'B': [1, 2, 3, 4, 5, 6, 7, 8]}
IIUC, you could drop consecutively duplicate signed "A"s (so like, the row with 2 in column "A" is dropped because it has the same sign as 1, the immediate previous value in column "A"):
out = df[df['A'].ge(0).astype(int).diff()!=0]
it turns out, you don't need to convert to int (thanks #Corralien):
out = df[df['A'].ge(0).diff()!=0]
Output:
timestamp A
0 1422404668 1
2 1422404672 -3
4 1422404676 5
5 1422404678 -6
7 1422404680 8
Edit:
Given OP's edit, we could use cumsum on the mask to create group numbers and groupby it and use agg to call different methods on different columns:
out = df.groupby(df['A'].ge(0).diff().ne(0).cumsum()).agg({'timestamp':'first', 'A':'first', 'B':'sum'}).reset_index(drop=True)
Output:
timestamp A B
0 1422404668 1 3
1 1422404672 -3 7
2 1422404676 5 5
3 1422404678 -6 13
4 1422404680 8 8
something like this?
I made two frames with negative values from A column and positive values from A column.
Then find first occurence for negative and positive and concat frame to out.
df_positive = df[df['A'] > 0]
df_negative = df[df['A'] < 0]
df_positive = df_positive.groupby('A').first().reset_index()
df_negative = df_negative.groupby('A').first().reset_index()
out = pd.concat([df_positive,df_negative ])[['timestamp', 'A']]
Related
I have a pandas dataframe like this:
col
0 3
1 5
2 9
3 5
4 6
5 6
6 11
7 6
8 2
9 10
that could be created in Python with the code:
import pandas as pd
df = pd.DataFrame(
{
'col': [3, 5, 9, 5, 6, 6, 11, 6, 2, 10]
}
)
I want to find the rows that have a value greater than 8, and also there is at least one row before them that has a value less than 4.
So the output should be:
col
2 9
9 10
You can see that index 0 has a value equal to 3 (less than 4) and then index 2 has a value greater than 8. So we add index 2 to the output and continue to check for the next rows. But we don't anymore consider indexes 0, 1, 2, and reset the work.
Index 6 has a value equal to 11, but none of the indexes 3, 4, 5 has a value less than 4, so we don't add index 6 to the output.
Index 8 has a value equal to 2 (less than 4) and index 9 has a value equal to 10 (greater than 8), so index 9 is added to the output.
It's my priority not to use any for-loops for the code.
Have you any idea about this?
Boolean indexing to the rescue:
# value > 8
m1 = df['col'].gt(8)
# get previous value <4
# check if any occurred previously
m2 = df['col'].shift().lt(4).groupby(m1[::-1].cumsum()).cummax()
df[m1&m2]
Output:
col
2 9
9 10
Check Below code using SHIFT:
df['val'] = np.where(df['col']>8, True, False).cumsum()
df['val'] = np.where(df['col']>8, df['val']-1, df['val'])
df.assign(min_value = df.groupby('val')['col'].transform('min')).\
query('col>8 and min_value<4')[['col']]
OUTPUT:
Although my previous question was answered here Python dataframe new column with value based on value in other row I still want to know how to use a column value in iloc (or shift or rolling, etc.)
I have a dataframe with two columns A and B, how do I use the value of column B in iloc? Or shift()?
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
Using iloc I get this error.
df['C'] = df['A'] * df['A'].iloc[df['B']]
ValueError: cannot reindex from a duplicate axis
Using shift() another one.
df['C'] = df['A'] * df['A'].shift(df['B'])
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is it possible what I want to do? If yes, how? If no, why not?
Use numpy indexing:
print (df['A'].to_numpy()[df['B'].to_numpy()])
[4 3 6 4 8 5 5 4 3 3]
df['C'] = df['A'] * df['A'].to_numpy()[df['B'].to_numpy()]
print (df)
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9
numpy indexing is the fastest way i agree, but you can use list comprehension + iloc too:
d = {'A': [8, 2, 4, 5, 6, 4, 3, 5, 5, 3], 'B': [2, -1, 4, 5, 0, -3, 8, 2, 6, -1]}
df = pd.DataFrame(data=d)
df['C'] = df['A'] * [df['A'].iloc[i] for i in df['B']]
A B C
0 8 2 32
1 2 -1 6
2 4 4 24
3 5 5 20
4 6 0 48
5 4 -3 20
6 3 8 15
7 5 2 20
8 5 6 15
9 3 -1 9
If I slice a dataframe with something like
>>> df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
>>> df.loc[df['A'] == 1]
# or
>>> df[df['A'] == 1]
A
0 1
4 1
7 1
8 1
how could I pad my selections by a buffer of 1 and get the each of the indices 0, 1, 3, 4, 5, 6, 7, 8, 9? I want to select all rows for which the value in column 'A' is 1, but also a row before or after any such row.
edit I'm hoping to figure out a solution that works for arbitrary pad sizes, rather than just for a pad size of 1.
edit 2 here's another example illustrating what I'm going for
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
and we're looking for pad == 2. In this case I'd be trying to fetch rows 0, 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15, 16.
you can use shift with bitwise or |
c = df['A'] == 1
df[c|c.shift()|c.shift(-1)]
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
For arbitrary pad sizes, you may try where, interpolate, and notna to create the mask
n = 2
c = df.where(df['A'] == 1)
m = c.interpolate(limit=n, limit_direction='both').notna()
df[m]
Out[61]:
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Here is an approach that allows for multiple pad levels. Use ffill and bfill on the boolean mask (df['A'] == 1), after converting the False values to np.nan:
import numpy as np
pad = 2
df[(df['A'] == 1).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
Here it is in action:
def padsearch(df, column, value, pad):
return df[(df[column] == value).replace(False, np.nan).ffill(limit=pad).bfill(limit=pad).replace(np.nan,False).astype(bool)]
# your first example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,1,3,2,1,1,4,5,6]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=1))
# your other example
df = pd.DataFrame(data=[[x] for x in [1,2,3,5,3,2,1,1,4,5,6,0,0,3,1,2,4,5]], columns=['A'])
print(padsearch(df=df, column='A', value=1, pad=2))
Result:
A
0 1
1 2
3 5
4 1
5 3
6 2
7 1
8 1
9 4
A
0 1
1 2
2 3
4 3
5 2
6 1
7 1
8 4
9 5
12 0
13 3
14 1
15 2
16 4
Granted the command is far less nice, and its a little clunky to be converting the False to and from null. But it's still using all Pandas builtins, so it is fairly quick still.
I found another solution but not nearly as slick as some of the ones already posted.
# setup
df = ...
pad = 2
# determine set of indicies
indices = set(
[
x for x in filter(
lambda x: x>=0,
[
x+y
for x in df[df['A'] == 1].index
for y in range(-pad, pad+1)
]
)
]
)
# fetch rows
df.iloc[[*indices]]
My data is like this:
df = pd.DataFrame({'a': [5,0,0, 6, 0, 0, 0 , 12]})
I want to count the zeros above the 6 and replace them with (6/count+1)=(6/3)=2 (I will also replace the original 6)
I also want to do a similar thing with the zeros above the 12.
So, (12/count)=(12/3)=4
So the final result will be:
[5,2,2, 2, 3, 3, 3 , 3]
I am not sure how to start. Are there any functions that do this?
Thanks.
Use GroupBy.transform with mean and custom groups created with test not equal 0, swap order, cumulative sum and swap order to original:
g = df['a'].ne(0).iloc[::-1].cumsum().iloc[::-1]
df['b'] = df.groupby(g)['a'].transform('mean')
print (df)
a b
0 5 5
1 0 2
2 0 2
3 6 2
4 0 3
5 0 3
6 0 3
7 12 3
I have a dataframe that has got many columns.
I want to apply a function on each row that alters all the columns based on different column.
def mark(row):
columns = get_columns_to_alter(row['Text'])
for c in columns:
row[c] = True
And I was trying to use apply function
df.apply(mark, axis=1)
But it does not alter these columns. What am I doing wrong? The function I gave is a psuedocode but it gets names of columns to change basing on "Text" column.
OK,
This is a bit confusing, to be honest.
Several issues I see:
First, DataFrame.apply a function to each column should look more like:
df.apply(lambda x: mark(x), axis=1)
so that you actually loop through each row.
Second, DataFrame.apply creates a copy Series for each row (in your case); thus, the changes are not applied to df but to the new row value. If you want to change df, you need to (a) make sure that mark returns something and (b) to assign it to something else:
def mark(row):
columns = get_columns_to_alter(row['Text'])
if len(columns) > 0:
row[columns] = True
return row
new_df = df.apply(lambda x: mark(x), axis=1)
Something like this should do what you expect.
Here is one solution via numpy and itertools.chain. As far as possible, it's a good idea to remove loops.
from itertools import chain
import numpy as np
df = pd.DataFrame(np.random.randint(0, 9, (10, 10)))
df['cols'] = [np.random.randint(0, 9, 3) for _ in df]
def calc_cols(s):
arr = s.values.tolist()
# apply function on arr here, e.g.
# arr = list(map(f, arr))
idx = np.repeat(list(range(len(arr))), list(map(len, arr)))
return idx, list(chain(*arr))
A, cols = df.values, calc_cols(df['cols'])
A[cols[0], cols[1]] = -1
df_res = pd.DataFrame(A, columns=df.columns)
# 0 1 2 3 4 5 6 7 8 9 cols
# 0 2 4 -1 4 -1 2 6 6 8 1 [4, 4, 2]
# 1 4 -1 -1 3 4 4 -1 5 6 7 [2, 1, 6]
# 2 -1 1 7 1 2 -1 2 2 -1 0 [8, 0, 5]
# 3 2 4 -1 6 -1 8 6 -1 0 3 [7, 2, 4]
# 4 -1 5 5 2 8 2 -1 8 -1 6 [8, 6, 0]
# 5 5 6 0 3 5 -1 -1 5 3 7 [6, 5, 6]
# 6 -1 0 7 1 4 -1 -1 6 1 8 [5, 6, 0]
# 7 2 6 4 6 -1 6 -1 5 7 6 [6, 4, 6]
# 8 -1 8 1 -1 0 7 8 -1 2 3 [3, 0, 7]
# 9 2 4 6 6 -1 -1 0 2 -1 0 [4, 8, 5]