Frequency of repetitive position in pandas data frame

Frequency of repetitive position in pandas data frame - python

Hi I am working to find out repetitive position of the following data frame:
data = pd.DataFrame()
data ['league'] =['A','A','A','A','A','A','B','B','B']
data ['Team'] = ['X','X','X','Y','Y','Y','Z','Z','Z']
data ['week'] =[1,2,3,1,2,3,1,2,3]
data ['position']= [1,1,2,2,2,1,2,3,4]
I will compare the data for position from previous row, it is it the same, I will assign one. If it is different previous row, I will assign as 1
My expected outcome will be as follow:
It means I will group by (League, Team and week) and work out the frequency.
Can anyone advise how to do that in Pandas
Thanks,
Zep

Use diff, and compare against 0:
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
print(df)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
For performance reasons, you should try to avoid a fillna call.
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['frequency'] = df['position'].diff().abs().fillna(0,downcast='infer')
%%timeit
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
83.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To extend this answer to work in a groupby, use
v = df.groupby(['league', 'Team', 'week']).position.diff()
v[np.isnan(v)] = 0
df['frequency'] = v.ne(0).astype(int)

Use diff and abs with fillna:
data['frequency'] = data['position'].diff().abs().fillna(0,downcast='infer')
print(data)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
Using groupby gives all zeros, since you are comparing within groups not on whole dataframe.
data.groupby(['league', 'Team', 'week'])['position'].diff().fillna(0,downcast='infer')
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Name: position, dtype: int64

Related

Create pandas subtraction column based on one other column in two conditions

I have a dataframe with subjects in two different conditions and many value columns.
d = {
"subject": [1, 1, 2, 2],
"condition": ["on", "off", "on", "off"],
"value": [1, 2, 3, 5]
}
df = pd.DataFrame(data=d)
df
subject
condition
value
0
1
on
1
1
1
off
2
2
2
on
3
3
2
off
5
I would like to get new columns which indicate the difference off-on between both conditions. In this case I would like to get:
subject
condition
value
off-on
0
1
on
1
1
1
1
off
2
1
2
2
on
3
2
3
2
off
5
2
How would I best do that?
I could achieve the result using this code:
onoff = (df[df.condition == "off"].value.reset_index() - df[df.condition == "on"].value.reset_index()).value
for idx, sub in enumerate(df.subject.unique()):
df.loc[df.subject == sub, "off-on"] = onoff.iloc[idx]
But it seems quite tedious and slow. I was hoping for a solution without loop. I have many rows and very many value columns. Is there a better way?

Use a pivot combined with map:
df['off-on'] = df['subject'].map(
df.pivot(index='subject', columns='condition', values='value')
.eval('off-on')
)
Or with a MultiIndex (more efficient than a pivot):
s = df.set_index(['condition', 'subject'])['value']
df['off-on'] = df['subject'].map(s['off']-s['on'])
Output:
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
timings
On 100k subjects
# MultiIndexing
43.2 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# pivot
77 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Use DataFrame.pivot for possible easy mapping subtracted column off and on by Series.map:
df1 = df.pivot(index='subject', columns='condition', values='value')
df['off-on'] = df['subject'].map(df1['off'].sub(df1['on']))
print (df)
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
Details:
print (df.pivot(index='subject', columns='condition', values='value'))
condition off on
subject
1 2 1
2 5 3
print (df1['off'].sub(df1['on']))
subject
1 1
2 2
dtype: int64

merge groupby results directly back to dataframe

Suppose I have the following data:
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
id1 id2 x
0 1 1 10
1 1 2 20
2 1 3 50
3 2 1 15
4 2 2 20
5 2 3 30
6 3 1 40
7 3 2 70
The dataframe is sorted along the two ids. Suppose I'd like to know the value of x of the FIRST observation within each group of id1 observations. The result would be like
id1 id2 x first_x
1 1 10 10
1 2 30 10
1 3 50 10
2 1 15 15
2 2 20 15
2 3 30 15
3 1 40 40
3 2 70 40
How do I achieve this 'subscripting'? Ideally, the new column would be filled for each observation.
I thought along the lines of
df['first_x'] = df.groupby(['id1'])[0]

I think simpliest is transform with first:
df['first_x'] = df.groupby('id1')['x'].transform('first')
Or map by Series created by drop_duplicates:
df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
print (df)
id1 id2 x first_x
0 1 1 10 10
1 1 2 20 10
2 1 3 50 10
3 2 1 15 15
4 2 2 20 15
5 2 3 30 15
6 3 1 40 40
7 3 2 70 40
First is shortest and fastest solution:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)
In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop
In [180]: %%timeit
...: first_xs = df.groupby(['id1']).first().to_dict()['x']
...:
...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
...:
1 loop, best of 3: 524 ms per loop
In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop
In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop

Something like this?
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')

As you need to consider the entire dataframe when building values for each row, you need an intermediate step.
The following gets your first_x value using a group by, then uses that as a map to add a new column.
import pandas as pd
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
first_xs = df.groupby(['id1']).first().to_dict()['x']
df['first_x'] = df['id1'].map(lambda id: first_xs[id])

Is there any column match or row match function in python?

I have two data frame lets say:
dataframe A with column 'name'
name
0 4
1 2
2 1
3 3
Another dataframe B with two columns i.e. name and value
name value
0 3 5
1 2 6
2 4 7
3 1 8
I want to rearrange the value in dataframe B according to the name column in dataframe A
I am expecting final dataframe similar to this:
name value
0 4 7
1 2 6
2 1 8
3 3 5

Here are two options:
dfB.set_index('name').loc[dfA.name].reset_index()
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Or,
dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
dfA
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Timings:
import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
names = np.arange(10**7)
prng.shuffle(names)
dfA = pd.DataFrame({'name': names})
prng.shuffle(names)
dfB = pd.DataFrame({'name': names, 'value': prng.randint(0, 100, 10**7)})
%timeit dfB.set_index('name').loc[dfA.name].reset_index()
1 loop, best of 3: 2.27 s per loop
%timeit dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
1 loop, best of 3: 1.65 s per loop
%timeit dfB.set_index('name').ix[dfA.name].reset_index()
1 loop, best of 3: 1.66 s per loop

Removing rows in Pandas based on multiple columns

In Pandas, I have a dataframe with ZipCode, Age, and a bunch of columns that should all have values 1 or 0, ie:
ZipCode Age A B C D
12345 21 0 1 1 1
12345 22 1 0 1 4
23456 45 1 0 1 1
23456 21 3 1 0 0
I want to delete all rows in which 0 or 1 doesn't appear in columns A,B,C, or D as a way to clean up the data. In this case, I would remove the 2nd and 4th row because 4 appears in column D in row 2 and 3 appears in column A in row 4. I want to do this even if I have 100 columns to check such that I don't have to look up every column one by one in my conditional statement. How would I do this?

Use isin to test for membership and all to test if all row values are True and use this boolean mask to filter the df:
In [12]:
df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
Out[12]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1

You can opt for a vectorized solution:
In [64]: df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
Out[64]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1

Other two solutions works well but if you interested in speed you should look at numpy in1d function:
data=df.loc[:, 'A':]
In [72]: df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
Out[72]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
Timing:
In [73]: %timeit data=df.loc[:, 'A':]; df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
1000 loops, best of 3: 558 us per loop
In [74]: %timeit df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
1000 loops, best of 3: 843 us per loop
In [75]: %timeit df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
1000 loops, best of 3: 1.44 ms per loop

Multiply two Pandas dataframes with same shape and same columns names

I have two dataframes A, B with NxM shape. I want to multiply both such that each element of A is multiplied with respective element of B.
e.g:
A,B = input dataframes
C = final dataframe
I want C[i][j] = A[i][j]*B[i][j] for i=1..N and j=1..M
I searched but couldn't get exactly the solution.

I think you can use:
C = A * B
Next solution is with mul:
C = A.mul(B)
Sample:
print A
a b
0 1 3
1 2 4
2 3 7
print B
a b
0 2 3
1 1 4
2 3 2
print A * B
a b
0 2 9
1 2 16
2 9 14
print A.mul(B)
a b
0 2 9
1 2 16
2 9 14
Timings with lenght of A and B 300k:
In [218]: %timeit A * B
The slowest run took 4.27 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.57 ms per loop
In [219]: %timeit A.mul(B)
100 loops, best of 3: 3.56 ms per loop
A = pd.concat([A]*100000).reset_index(drop=True)
B = pd.concat([B]*100000).reset_index(drop=True)
print A * B
print A.mul(B)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Frequency of repetitive position in pandas data frame - python

Related

Create pandas subtraction column based on one other column in two conditions

merge groupby results directly back to dataframe

Is there any column match or row match function in python?

Removing rows in Pandas based on multiple columns

Multiply two Pandas dataframes with same shape and same columns names

Categories

Resources