Frequency of repetitive position in pandas data frame - python

Hi I am working to find out repetitive position of the following data frame:
data = pd.DataFrame()
data ['league'] =['A','A','A','A','A','A','B','B','B']
data ['Team'] = ['X','X','X','Y','Y','Y','Z','Z','Z']
data ['week'] =[1,2,3,1,2,3,1,2,3]
data ['position']= [1,1,2,2,2,1,2,3,4]
I will compare the data for position from previous row, it is it the same, I will assign one. If it is different previous row, I will assign as 1
My expected outcome will be as follow:
It means I will group by (League, Team and week) and work out the frequency.
Can anyone advise how to do that in Pandas
Thanks,
Zep

Use diff, and compare against 0:
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
print(df)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
For performance reasons, you should try to avoid a fillna call.
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['frequency'] = df['position'].diff().abs().fillna(0,downcast='infer')
%%timeit
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
83.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To extend this answer to work in a groupby, use
v = df.groupby(['league', 'Team', 'week']).position.diff()
v[np.isnan(v)] = 0
df['frequency'] = v.ne(0).astype(int)

Use diff and abs with fillna:
data['frequency'] = data['position'].diff().abs().fillna(0,downcast='infer')
print(data)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
Using groupby gives all zeros, since you are comparing within groups not on whole dataframe.
data.groupby(['league', 'Team', 'week'])['position'].diff().fillna(0,downcast='infer')
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Name: position, dtype: int64

Related

Create pandas subtraction column based on one other column in two conditions

I have a dataframe with subjects in two different conditions and many value columns.
d = {
"subject": [1, 1, 2, 2],
"condition": ["on", "off", "on", "off"],
"value": [1, 2, 3, 5]
}
df = pd.DataFrame(data=d)
df
subject
condition
value
0
1
on
1
1
1
off
2
2
2
on
3
3
2
off
5
I would like to get new columns which indicate the difference off-on between both conditions. In this case I would like to get:
subject
condition
value
off-on
0
1
on
1
1
1
1
off
2
1
2
2
on
3
2
3
2
off
5
2
How would I best do that?
I could achieve the result using this code:
onoff = (df[df.condition == "off"].value.reset_index() - df[df.condition == "on"].value.reset_index()).value
for idx, sub in enumerate(df.subject.unique()):
df.loc[df.subject == sub, "off-on"] = onoff.iloc[idx]
But it seems quite tedious and slow. I was hoping for a solution without loop. I have many rows and very many value columns. Is there a better way?
Use a pivot combined with map:
df['off-on'] = df['subject'].map(
df.pivot(index='subject', columns='condition', values='value')
.eval('off-on')
)
Or with a MultiIndex (more efficient than a pivot):
s = df.set_index(['condition', 'subject'])['value']
df['off-on'] = df['subject'].map(s['off']-s['on'])
Output:
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
timings
On 100k subjects
# MultiIndexing
43.2 ms ± 2.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# pivot
77 ms ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use DataFrame.pivot for possible easy mapping subtracted column off and on by Series.map:
df1 = df.pivot(index='subject', columns='condition', values='value')
df['off-on'] = df['subject'].map(df1['off'].sub(df1['on']))
print (df)
subject condition value off-on
0 1 on 1 1
1 1 off 2 1
2 2 on 3 2
3 2 off 5 2
Details:
print (df.pivot(index='subject', columns='condition', values='value'))
condition off on
subject
1 2 1
2 5 3
print (df1['off'].sub(df1['on']))
subject
1 1
2 2
dtype: int64

merge groupby results directly back to dataframe

Suppose I have the following data:
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
id1 id2 x
0 1 1 10
1 1 2 20
2 1 3 50
3 2 1 15
4 2 2 20
5 2 3 30
6 3 1 40
7 3 2 70
The dataframe is sorted along the two ids. Suppose I'd like to know the value of x of the FIRST observation within each group of id1 observations. The result would be like
id1 id2 x first_x
1 1 10 10
1 2 30 10
1 3 50 10
2 1 15 15
2 2 20 15
2 3 30 15
3 1 40 40
3 2 70 40
How do I achieve this 'subscripting'? Ideally, the new column would be filled for each observation.
I thought along the lines of
df['first_x'] = df.groupby(['id1'])[0]
I think simpliest is transform with first:
df['first_x'] = df.groupby('id1')['x'].transform('first')
Or map by Series created by drop_duplicates:
df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
print (df)
id1 id2 x first_x
0 1 1 10 10
1 1 2 20 10
2 1 3 50 10
3 2 1 15 15
4 2 2 20 15
5 2 3 30 15
6 3 1 40 40
7 3 2 70 40
First is shortest and fastest solution:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)
In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop
In [180]: %%timeit
...: first_xs = df.groupby(['id1']).first().to_dict()['x']
...:
...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
...:
1 loop, best of 3: 524 ms per loop
In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop
In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop
Something like this?
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
As you need to consider the entire dataframe when building values for each row, you need an intermediate step.
The following gets your first_x value using a group by, then uses that as a map to add a new column.
import pandas as pd
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
first_xs = df.groupby(['id1']).first().to_dict()['x']
df['first_x'] = df['id1'].map(lambda id: first_xs[id])

Is there any column match or row match function in python?

I have two data frame lets say:
dataframe A with column 'name'
name
0 4
1 2
2 1
3 3
Another dataframe B with two columns i.e. name and value
name value
0 3 5
1 2 6
2 4 7
3 1 8
I want to rearrange the value in dataframe B according to the name column in dataframe A
I am expecting final dataframe similar to this:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Here are two options:
dfB.set_index('name').loc[dfA.name].reset_index()
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Or,
dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
dfA
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Timings:
import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
names = np.arange(10**7)
prng.shuffle(names)
dfA = pd.DataFrame({'name': names})
prng.shuffle(names)
dfB = pd.DataFrame({'name': names, 'value': prng.randint(0, 100, 10**7)})
%timeit dfB.set_index('name').loc[dfA.name].reset_index()
1 loop, best of 3: 2.27 s per loop
%timeit dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
1 loop, best of 3: 1.65 s per loop
%timeit dfB.set_index('name').ix[dfA.name].reset_index()
1 loop, best of 3: 1.66 s per loop

Removing rows in Pandas based on multiple columns

In Pandas, I have a dataframe with ZipCode, Age, and a bunch of columns that should all have values 1 or 0, ie:
ZipCode Age A B C D
12345 21 0 1 1 1
12345 22 1 0 1 4
23456 45 1 0 1 1
23456 21 3 1 0 0
I want to delete all rows in which 0 or 1 doesn't appear in columns A,B,C, or D as a way to clean up the data. In this case, I would remove the 2nd and 4th row because 4 appears in column D in row 2 and 3 appears in column A in row 4. I want to do this even if I have 100 columns to check such that I don't have to look up every column one by one in my conditional statement. How would I do this?
Use isin to test for membership and all to test if all row values are True and use this boolean mask to filter the df:
In [12]:
df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
Out[12]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
You can opt for a vectorized solution:
In [64]: df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
Out[64]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
Other two solutions works well but if you interested in speed you should look at numpy in1d function:
data=df.loc[:, 'A':]
In [72]: df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
Out[72]:
ZipCode Age A B C D
0 12345 21 0 1 1 1
2 23456 45 1 0 1 1
Timing:
In [73]: %timeit data=df.loc[:, 'A':]; df[np.in1d(data.values,[0,1]).reshape(data.shape).all(axis=1)]
1000 loops, best of 3: 558 us per loop
In [74]: %timeit df[df.ix[:,'A':].isin([0,1]).all(axis=1)]
1000 loops, best of 3: 843 us per loop
In [75]: %timeit df[df[['A','B','C','D']].isin([0,1]).sum(axis=1)==4]
1000 loops, best of 3: 1.44 ms per loop

Multiply two Pandas dataframes with same shape and same columns names

I have two dataframes A, B with NxM shape. I want to multiply both such that each element of A is multiplied with respective element of B.
e.g:
A,B = input dataframes
C = final dataframe
I want C[i][j] = A[i][j]*B[i][j] for i=1..N and j=1..M
I searched but couldn't get exactly the solution.
I think you can use:
C = A * B
Next solution is with mul:
C = A.mul(B)
Sample:
print A
a b
0 1 3
1 2 4
2 3 7
print B
a b
0 2 3
1 1 4
2 3 2
print A * B
a b
0 2 9
1 2 16
2 9 14
print A.mul(B)
a b
0 2 9
1 2 16
2 9 14
Timings with lenght of A and B 300k:
In [218]: %timeit A * B
The slowest run took 4.27 times longer than the fastest. This could mean that an intermediate result is being cached
100 loops, best of 3: 3.57 ms per loop
In [219]: %timeit A.mul(B)
100 loops, best of 3: 3.56 ms per loop
A = pd.concat([A]*100000).reset_index(drop=True)
B = pd.concat([B]*100000).reset_index(drop=True)
print A * B
print A.mul(B)

Categories