Pandas: Use selected amount of previous rows in apply function - python

lets say I have dataframe below:
index value
1 1
2 2
3 3
4 4
I want to apply a function to each row using previous two rows using "apply" statement. Lets say for example I want to multiple current row and previous 2 rows if it exists. (This could be any funtion)
Result:
index value result
1 1 nan
2 2 nan
3 3 6
4 4 24
Thank you.

You can try rolling with prod:
df['result'] = df['value'].rolling(3).apply(lambda x: x.prod())
Output:
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0

Use assign function:
df = df.assign(result = lambda x: x['value'].cumprod().tail(len(df)-2))

I presume you have more than four rows. If so, please try groupby every four rows, cumproduct, choose the last 2 and join to the original datframe.
df['value']=df.index.map(df.assign(result=df['value'].cumprod(0)).groupby(df.index//4).result.tail(2).to_dict())
If just four rows then this should you;
Lets try combine .cumprod() and .tail()
df['result']=df['value'].cumprod(0).tail(2)
index value result
0 1 1 NaN
1 2 2 NaN
2 3 3 6.0
3 4 4 24.0

Related

Collapse a dataframes column into its distinct values and create a new column based on anothers frequency

I would like to take a dataframe such as:
USER PACKAGE
0 1 1
1 1 1
2 1 2
3 1 1
4 1 2
5 1 3
6 2 ...
And select the distinct USERS and then have new columns that are based on the frequency of the different packages. i.e highest frequency package, second highest etc.
User First Second Third
0 1 1 2 3
1 2 ...
I can implement this with for loops but thats obviously bad using dataframes, I need to run this on millions of records, can't quite find a vectorized way of doing it.
Cheers
On SO you're supposed to attempt it and post your own code. Here are some hints for implementing the solution:
Do .groupby('USER')... then .value_counts() ...
(don't need to .sort(), since .value_counts() does that by default)
take the .head(3)...
then pivot into a table, in that same pivot command there's an option to add the column names 'First, Second, Third'
You can use SeriesGroupBy.value_counts with default sorting, so get first 3 index values, convert to Series, reshape by Series.unstack, rename columns and last convert index to column:
print (df)
USER PACKAGE
0 1 1
1 1 1
2 1 2
3 1 1
4 1 2
5 1 3
6 2 3
df = (df.groupby('USER')['PACKAGE']
.apply(lambda x: pd.Series(x.value_counts().index[:3]))
.unstack()
.rename(columns= dict(enumerate(['First','Second','Third'])))
.reset_index())
print (df)
USER First Second Third
0 1 1.0 2.0 3.0
1 2 3.0 NaN NaN
If need all counts:
df = (df.groupby('USER')['PACKAGE']
.apply(lambda x: pd.Series(x.value_counts().index))
.unstack())
print (df)
0 1 2
USER
1 1.0 2.0 3.0
2 3.0 NaN NaN
EDIT: Another idea, I hope faster is use:
s = (df.groupby('USER')['PACKAGE']
.apply(lambda x: x.value_counts().index[:3]))
df = (pd.DataFrame(s.tolist(),index=s.index, columns=['First','Second','Third'])
.reset_index())
print (df)
USER First Second Third
0 1 1 2.0 3.0
1 2 3 NaN NaN
I assumed the count the number of user and package occurrences
USER =[1,1,1,1,1,1,2]
PACKAGE=[1,1,2,1,2,3,3]
df=pd.DataFrame({'user':USER,'package':PACKAGE})
results=df.groupby(['user','package']).size()
results=results.sort_values(ascending=False)
results=results.unstack(level='package').fillna(0)
results=results.rename(columns={1:'First',2:'Second',3:'Third'})
print(results)
output:
package First Second Third
user
1 3.0 2.0 1.0
2 0.0 0.0 1.0
The highest frequency package is type 1, second highest package is type 2 and third highest package is type 3 for user 1. the highest rank for user 2 is type 3. You can do a lookup on the results to produce this output.
Try using Groupby:
df.groupby(['X']).get_group('A')

Get nth row of groups and fill with 'None' if row is missing

I have a df:
a b c
1 2 3 6
2 2 5 7
3 4 6 8
I want every nth row of groupby a:
w=df.groupby('a').nth(0) #first row
x=df.groupby('a').nth(1) #second row
The second group of the df has no second row, in this case I want to have 'None' values.
[In:] df.groupby('a').nth(1)
[Out:]
a b c
1 2 5 7
2 None None None
Or maybe simplier:
The df has 1-4 rows within groups. If a group has less than 4 rows, I want to extend the group, so that it has 4 rows and fill the missing rows with 'None'. Afterwards if I pick the nth row of groups, I have the desired output.
If you are just interested in a specific nth but not have enough rows in some groups, you can consider to use reindex with unique value from the column a like:
print (df.groupby('a').nth(1).reindex(df['a'].unique()).reset_index())
a b c
0 2 5.0 7.0
1 4 NaN NaN
One way is to assign a count/rank column and reindex/stack:
n=2
(df.assign(rank=df.groupby('a').cumcount())
.query(f'rank < #n')
.set_index(['a','rank'])
.unstack('rank')
.stack('rank', dropna=False)
.reset_index()
.drop('rank', axis=1)
)
Output:
a b c
0 2 3.0 6.0
1 2 5.0 7.0
2 4 6.0 8.0
3 4 NaN NaN

Compute difference between rows prior to and following to the specific row_pandas

I want to find the difference between rows prior to and following to the specific row. Specifically, I have the following dataset:
Number of rows A
1 4
2 2
3 2
4 3
5 2
I should get the following data:
Number of rows A B
1 4 NaN (since there is not row before this row)
2 2 2 (4-2)
3 2 -1 (2-3)
4 3 0 (2-2)
5 2 NaN (since there is not row after this row)
As you can see, each row in column B, equal the difference between previous and following rows in column A. For example, second row in column B, equal the difference between value in the first row in column A and value in the third row in column A. IMPORTANT POINT: I do not need only previous and following. I should find the difference between previous 2 and the following 2 rows. I meant the value in row Number 23 in column B will be equal the difference between the value in row Number 21 in column A and the value in row Number 25 in column A. I use the previous and the following rows for simplicity.
I hope I could explain it.
Seems like you need a centered rolling window. You can specify that with the arg center=True
>>> df.A.rolling(3, center=True).apply(lambda s: s[0]-s[-1])
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
Name: A, dtype: float64
This approach works for any window. Notice that this is a centered window, so the size of the window has to be N+N+1 (where N is the number of lookback and lookforward rows, and you add 1 to account for the value in the middle). Thus, the general formula is
window = 2*N + 1
If you need 2 rows before and 2 after, then N = 2. if you need 5 and 5, N=5 (and window = 11) etc. The apply lambda stays the same.
Let the series (i.e. DataFrame column) be s.
You want:
s.shift(1) - s.shift(-1)
You need to use .shift on the column (series) where you want to run your calculation.
With shift(1) you get the previous row, with shift(-1) you get the next row.
from there you need to calculate previous - next
>>> s = pd.Series([4,2,2,3,2])
>>> s
0 4
1 2
2 2
3 3
4 2
dtype: int64
# previous
>>> s.shift(1)
0 NaN
1 4.0
2 2.0
3 2.0
4 3.0
dtype: float64
# next
>>> s.shift(-1)
0 2.0
1 2.0
2 3.0
3 2.0
4 NaN
dtype: float64
# previous - next
>>> s.shift(1)-s.shift(-1)
0 NaN
1 2.0
2 -1.0
3 0.0
4 NaN
dtype: float64

In a Pandas dataframe, how can I extract the difference between the values on separate rows within the same column, conditional on a second column? [duplicate]

This question already has answers here:
Pandas groupby multiple fields then diff
(2 answers)
Closed 4 years ago.
This is part of a larger project, but I've broken my problem down into steps, so here's the first step. Take a Pandas dataframe, like this:
index | user time
---------------------
0 F 0
1 T 0
2 T 0
3 T 1
4 B 1
5 K 2
6 J 2
7 T 3
8 J 4
9 B 4
For each unique user, can I extract the difference between the values in column "time," but with some conditions?
So, for example, there are two instances of user J, and the "time" difference between these two instances is 2. Can I extract the difference, 2, between these two rows? Then if that user appears again, extract the difference between that row and the previous appearance of that user in the dataframe?
I believe need DataFrameGroupBy.diff:
df['new'] = df.groupby('user')['time'].diff()
print (df)
user time new
0 F 0 NaN
1 T 0 NaN
2 T 0 0.0
3 T 1 1.0
4 B 1 NaN
5 K 2 NaN
6 J 2 NaN
7 T 3 2.0
8 J 4 2.0
9 B 4 3.0
I think np.where and pandas shifts does this
This subtract between two consecutive Time, only if the users are same
df1 = np.where (df['users'] == df['users'].shifts(-1), df['time'] - df['time'].shifts(-1), 'NaN')

issue with np.where() for creating new column in Pandas (possibly NaN issue?)

I have a dataframe with 2 columns, and I want to create a 3rd column based on a comparison between the 2 columns.
So the logic is:
column 1 val = 3, column 2 val = 4, so the new column value is nothing
column 1 val = 3, column 2 val = 2, so the new column is 3
It's a very similar problem to one previously asked but the answer there isn't working for me, using np.where()
Here's what I tried:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'],[FinalDF['a'],""])
and after that failed I tried to see if maybe it doesn't like the [x,y] I gave it, so I tried:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'],[1,0])
the result is always:
ValueError: either both or neither of x and y should be given
Edit: I also removed the [x,y], to see what happens, since the documentation says it is optional. But I still get an error:
ValueError: Length of values does not match length of index
Which is odd because they are sitting in the same dataframe, although one column does have some Nan values.
I don't think I can use np.select because I have a condition here. I've linked to the previous questions so readers can reference them in future questions.
Thanks for any help.
I think that this should work:
FinalDF['c'] = np.where(FinalDF['a']>FinalDF['b'], FinalDF['a'],"")
Example:
FinalDF = pd.DataFrame({'a':[4,2,4,5,5,4],
'b':[4,3,2,2,2,4],
})
print FinalDF
a b
0 4 4
1 2 3
2 4 2
3 5 2
4 5 2
5 4 4
Output:
a b c
0 4 4
1 2 3
2 4 2 4
3 5 2 5
4 5 2 5
5 4 4
or if the column b has to have a greater value of column a, use this:
FinalDF['c'] = np.where(FinalDF['a']<FinalDF['b'], FinalDF['b'],"")
Output:
a b c
0 4 4
1 2 3 3
2 4 2
3 5 2
4 5 2
5 4 4

Categories