merge groupby results directly back to dataframe

merge groupby results directly back to dataframe - python

Suppose I have the following data:
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
id1 id2 x
0 1 1 10
1 1 2 20
2 1 3 50
3 2 1 15
4 2 2 20
5 2 3 30
6 3 1 40
7 3 2 70
The dataframe is sorted along the two ids. Suppose I'd like to know the value of x of the FIRST observation within each group of id1 observations. The result would be like
id1 id2 x first_x
1 1 10 10
1 2 30 10
1 3 50 10
2 1 15 15
2 2 20 15
2 3 30 15
3 1 40 40
3 2 70 40
How do I achieve this 'subscripting'? Ideally, the new column would be filled for each observation.
I thought along the lines of
df['first_x'] = df.groupby(['id1'])[0]

I think simpliest is transform with first:
df['first_x'] = df.groupby('id1')['x'].transform('first')
Or map by Series created by drop_duplicates:
df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
print (df)
id1 id2 x first_x
0 1 1 10 10
1 1 2 20 10
2 1 3 50 10
3 2 1 15 15
4 2 2 20 15
5 2 3 30 15
6 3 1 40 40
7 3 2 70 40
First is shortest and fastest solution:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)
In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop
In [180]: %%timeit
...: first_xs = df.groupby(['id1']).first().to_dict()['x']
...:
...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
...:
1 loop, best of 3: 524 ms per loop
In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop
In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop

Something like this?
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')

As you need to consider the entire dataframe when building values for each row, you need an intermediate step.
The following gets your first_x value using a group by, then uses that as a map to add a new column.
import pandas as pd
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
first_xs = df.groupby(['id1']).first().to_dict()['x']
df['first_x'] = df['id1'].map(lambda id: first_xs[id])

Related

How to split a pandas df column that has objects? [duplicate]

This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 4 years ago.
I have a really simple Pandas dataframe where each cell contains a list. I'd like to split each element of the list into it's own column. I can do that by exporting the values and then creating a new dataframe. This doesn't seem like a good way to do this especially, if my dataframe had a column aside from the list column.
import pandas as pd
df = pd.DataFrame(data=[[[8,10,12]],
[[7,9,11]]])
df = pd.DataFrame(data=[x[0] for x in df.values])
Desired output:
0 1 2
0 8 10 12
1 7 9 11
Follow-up based on #Psidom answer:
If I did have a second column:
df = pd.DataFrame(data=[[[8,10,12], 'A'],
[[7,9,11], 'B']])
How do I not loose the other column?
Desired output:
0 1 2 3
0 8 10 12 A
1 7 9 11 B

You can loop through the Series with apply() function and convert each list to a Series, this automatically expand the list as a series in the column direction:
df[0].apply(pd.Series)
# 0 1 2
#0 8 10 12
#1 7 9 11
Update: To keep other columns of the data frame, you can concatenate the result with the columns you want to keep:
pd.concat([df[0].apply(pd.Series), df[1]], axis = 1)
# 0 1 2 1
#0 8 10 12 A
#1 7 9 11 B

You could do pd.DataFrame(df[col].values.tolist()) - is much faster ~500x
In [820]: pd.DataFrame(df[0].values.tolist())
Out[820]:
0 1 2
0 8 10 12
1 7 9 11
In [821]: pd.concat([pd.DataFrame(df[0].values.tolist()), df[1]], axis=1)
Out[821]:
0 1 2 1
0 8 10 12 A
1 7 9 11 B
Timings
Medium
In [828]: df.shape
Out[828]: (20000, 2)
In [829]: %timeit pd.DataFrame(df[0].values.tolist())
100 loops, best of 3: 15 ms per loop
In [830]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 4.06 s per loop
Large
In [832]: df.shape
Out[832]: (200000, 2)
In [833]: %timeit pd.DataFrame(df[0].values.tolist())
10 loops, best of 3: 161 ms per loop
In [834]: %timeit df[0].apply(pd.Series)
1 loop, best of 3: 40.9 s per loop

Frequency of repetitive position in pandas data frame

Hi I am working to find out repetitive position of the following data frame:
data = pd.DataFrame()
data ['league'] =['A','A','A','A','A','A','B','B','B']
data ['Team'] = ['X','X','X','Y','Y','Y','Z','Z','Z']
data ['week'] =[1,2,3,1,2,3,1,2,3]
data ['position']= [1,1,2,2,2,1,2,3,4]
I will compare the data for position from previous row, it is it the same, I will assign one. If it is different previous row, I will assign as 1
My expected outcome will be as follow:
It means I will group by (League, Team and week) and work out the frequency.
Can anyone advise how to do that in Pandas
Thanks,
Zep

Use diff, and compare against 0:
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
print(df)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
For performance reasons, you should try to avoid a fillna call.
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['frequency'] = df['position'].diff().abs().fillna(0,downcast='infer')
%%timeit
v = df.position.diff()
v[0] = 0
df['frequency'] = v.ne(0).astype(int)
83.7 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
10.9 ms ± 217 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To extend this answer to work in a groupby, use
v = df.groupby(['league', 'Team', 'week']).position.diff()
v[np.isnan(v)] = 0
df['frequency'] = v.ne(0).astype(int)

Use diff and abs with fillna:
data['frequency'] = data['position'].diff().abs().fillna(0,downcast='infer')
print(data)
league Team week position frequency
0 A X 1 1 0
1 A X 2 1 0
2 A X 3 2 1
3 A Y 1 2 0
4 A Y 2 2 0
5 A Y 3 1 1
6 B Z 1 2 1
7 B Z 2 3 1
8 B Z 3 4 1
Using groupby gives all zeros, since you are comparing within groups not on whole dataframe.
data.groupby(['league', 'Team', 'week'])['position'].diff().fillna(0,downcast='infer')
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
Name: position, dtype: int64

Is there any column match or row match function in python?

I have two data frame lets say:
dataframe A with column 'name'
name
0 4
1 2
2 1
3 3
Another dataframe B with two columns i.e. name and value
name value
0 3 5
1 2 6
2 4 7
3 1 8
I want to rearrange the value in dataframe B according to the name column in dataframe A
I am expecting final dataframe similar to this:
name value
0 4 7
1 2 6
2 1 8
3 3 5

Here are two options:
dfB.set_index('name').loc[dfA.name].reset_index()
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Or,
dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
dfA
Out:
name value
0 4 7
1 2 6
2 1 8
3 3 5
Timings:
import numpy as np
import pandas as pd
prng = np.random.RandomState(0)
names = np.arange(10**7)
prng.shuffle(names)
dfA = pd.DataFrame({'name': names})
prng.shuffle(names)
dfB = pd.DataFrame({'name': names, 'value': prng.randint(0, 100, 10**7)})
%timeit dfB.set_index('name').loc[dfA.name].reset_index()
1 loop, best of 3: 2.27 s per loop
%timeit dfA['value'] = dfA['name'].map(dfB.set_index('name')['value'])
1 loop, best of 3: 1.65 s per loop
%timeit dfB.set_index('name').ix[dfA.name].reset_index()
1 loop, best of 3: 1.66 s per loop

python - possible to apply percentile cuts to each column in a dataframe?

Is it possible to put percentile cuts on all columns of a dataframe with using a loop? This is how I am doing it now:
df = pd.DataFrame(np.random.randn(10,5))
df_q = pd.DataFrame()
for i in list(range(len(df.columns))):
df_q[i] = pd.qcut(df[i], 5, labels=list(range(5)))
I am hoping there is a slick pandas solution for this to avoid the use of a loop.
Thanks!

pd.qcut accepts an 1D array or Series as its argument. To apply pd.qcut to every column requires multiple calls to pd.qcut. So no matter how you dress it up, there will be a loop -- either explicit or implicit.
You could for example, use apply to call pd.qcut for each column:
In [46]: df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
Out[46]:
0 1 2 3 4
0 4 0 3 0 3
1 0 0 2 3 0
2 3 4 1 2 3
3 4 1 1 1 4
4 3 2 2 4 1
5 2 4 3 0 1
6 2 3 0 4 4
7 1 3 4 2 2
8 0 1 4 3 0
9 1 2 0 1 2
but under the hood, df.apply is using a for-loop, so it really isn't very different than your for-loop:
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
In [47]: %timeit df.apply(lambda x: pd.qcut(x, 5, labels=list(range(5))), axis=0)
100 loops, best of 3: 2.9 ms per loop
In [48]: %%timeit
df_q = pd.DataFrame()
for col in df:
df_q[col] = pd.qcut(df[col], 5, labels=list(range(5)))
100 loops, best of 3: 2.95 ms per loop
Note that
for i in list(range(len(df.columns))):
will only work if the columns of df happen to be sequential integers starting at 0.
It is more robust to use
for col in df:
to iterate over the columns of the DataFrame.

pandas merge and fill a dataframe with summary data

Supposing I have a data frame as follows:
frameA = pandas.DataFrame(dict(title=['a','a','a','b','b','b'],value=[1,2,3,4,5,6]))
frameB = pd.DataFrame(dict(title=['a','b'],value=[10,20]))
frameA looks like
title value
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
and frameB looks like
title value
0 a 10
1 b 20
I'd like to do some kind of merge or join so that I get
title value value2
a 1 10
a 2 10
a 3 10
b 4 20
b 5 20
b 6 20
I tried
pd.concat([frameA,frameB],axis=1)
and frameA.merge(frameB)
and frameA.apply(lambda x: frameB[x.title])
None of which work. I'm sure there is a really obvious way but I just cant seem to find it at the moment. Thanks
========================================
and right after I posted this I came across
Merging pandas dataframes using date as index seems to show one way. Are there any others?

Other way of merging :
frameA.merge(frameB,on ='title', how ='left')
title value_x value_y
0 a 1 10
1 a 2 10
2 a 3 10
3 b 4 20
4 b 5 20
5 b 6 20

What you want is a left join.
http://pandas.pydata.org/pandas-docs/dev/merging.html
pd.merge(frameA,frameB,on='title',how='left')
Out:
title value_x value_y
0 a 1 10
1 a 2 10
2 a 3 10
3 b 4 20
4 b 5 20
5 b 6 20

A faster method that doesn't involve renaming/dropping columns is to set the index of frameB to title and call map on frameA passing in the other df and passing a series. This will perform a lookup using the title values and return the values that match:
In [85]:
frameB.set_index('title', inplace=True)
frameA['value2'] = frameA['title'].map(frameB['value'])
frameA
Out[85]:
title value value2
0 a 1 10
1 a 2 10
2 a 3 10
3 b 4 20
4 b 5 20
5 b 6 20
If we compare the performance of merging against map, we can see that map is much faster nearly 5X faster:
In [70]:
%timeit pd.merge(frameA,frameB,on='title',how='left')
1000 loops, best of 3: 1.42 ms per loop
In [83]:
frameB.set_index('title', inplace=True)
%timeit frameA['value2'] = frameA['title'].map(frameB['value'])
1000 loops, best of 3: 286 µs per loop

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

merge groupby results directly back to dataframe - python

Something like this? df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x']) df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')

Related

How to split a pandas df column that has objects? [duplicate]

Frequency of repetitive position in pandas data frame

Is there any column match or row match function in python?

python - possible to apply percentile cuts to each column in a dataframe?

pandas merge and fill a dataframe with summary data

Categories

Resources