pandas merge and fill a dataframe with summary data

pandas merge and fill a dataframe with summary data - python

Supposing I have a data frame as follows:
frameA = pandas.DataFrame(dict(title=['a','a','a','b','b','b'],value=[1,2,3,4,5,6]))
frameB = pd.DataFrame(dict(title=['a','b'],value=[10,20]))
frameA looks like
title value
0 a 1
1 a 2
2 a 3
3 b 4
4 b 5
5 b 6
and frameB looks like
title value
0 a 10
1 b 20
I'd like to do some kind of merge or join so that I get
title value value2
a 1 10
a 2 10
a 3 10
b 4 20
b 5 20
b 6 20
I tried
pd.concat([frameA,frameB],axis=1)
and frameA.merge(frameB)
and frameA.apply(lambda x: frameB[x.title])
None of which work. I'm sure there is a really obvious way but I just cant seem to find it at the moment. Thanks
========================================
and right after I posted this I came across
Merging pandas dataframes using date as index seems to show one way. Are there any others?

Other way of merging :
frameA.merge(frameB,on ='title', how ='left')
title value_x value_y
0 a 1 10
1 a 2 10
2 a 3 10
3 b 4 20
4 b 5 20
5 b 6 20

What you want is a left join.
http://pandas.pydata.org/pandas-docs/dev/merging.html
pd.merge(frameA,frameB,on='title',how='left')
Out:
title value_x value_y
0 a 1 10
1 a 2 10
2 a 3 10
3 b 4 20
4 b 5 20
5 b 6 20

A faster method that doesn't involve renaming/dropping columns is to set the index of frameB to title and call map on frameA passing in the other df and passing a series. This will perform a lookup using the title values and return the values that match:
In [85]:
frameB.set_index('title', inplace=True)
frameA['value2'] = frameA['title'].map(frameB['value'])
frameA
Out[85]:
title value value2
0 a 1 10
1 a 2 10
2 a 3 10
3 b 4 20
4 b 5 20
5 b 6 20
If we compare the performance of merging against map, we can see that map is much faster nearly 5X faster:
In [70]:
%timeit pd.merge(frameA,frameB,on='title',how='left')
1000 loops, best of 3: 1.42 ms per loop
In [83]:
frameB.set_index('title', inplace=True)
%timeit frameA['value2'] = frameA['title'].map(frameB['value'])
1000 loops, best of 3: 286 µs per loop

Related

I want to add sub-index in python with pandas [duplicate]

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)

Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3

Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.

A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira

Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3

A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop

I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Sum of count where values are less than row

I'm using Pandas to come up with new column that will search through the entire column with values [1-100] and will count the values where it's less than the current row.
See [df] example below:
[A][NewCol]
1 0
3 2
2 1
5 4
8 5
3 2
Essentially, for each row I need to look at the entire Column A, and count how many values are less than the current row. So for Value 5, there are 4 values that are less (<) than 5 (1,2,3,3).
What would be the easiest way of doing this?
Thanks!

One way to do it like this, use rank with method='min':
df['NewCol'] = (df['A'].rank(method='min') - 1).astype(int)
Output:
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2

I am using numpy broadcast
s=df.A.values
(s[:,None]>s).sum(1)
Out[649]: array([0, 2, 1, 4, 5, 2])
#df['NewCol']=(s[:,None]>s).sum(1)
timing
df=pd.concat([df]*1000)
%%timeit
s=df.A.values
(s[:,None]>s).sum(1)
10 loops, best of 3: 83.7 ms per loop
%timeit (df['A'].rank(method='min') - 1).astype(int)
1000 loops, best of 3: 479 µs per loop

Try this code
A = [Your numbers]
less_than = []
for element in A:
counter = 0
for number in A:
if number < element:
counter += 1
less_than.append(counter)

You can do it this way:
import pandas as pd
df = pd.DataFrame({'A': [1,3,2,5,8,3]})
df['NewCol'] = 0
for idx, row in df.iterrows():
df.loc[idx, 'NewCol'] = (df.loc[:, 'A'] < row.A).sum()
print(df)
A NewCol
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2

Another way is sort and reset index:
m=df.A.sort_values().reset_index(drop=True).reset_index()
m.columns=['new','A']
print(m)
new A
0 0 1
1 1 2
2 2 3
3 3 3
4 4 5
5 5 8

You didn't specify if speed or memory usage was important (or if you had a very large dataset). The "easiest" way to do it is straightfoward: calculate how many are less then i for each entry in the column and collect those into a new column:
df=pd.DataFrame({'A': [1,3,2,5,8,3]})
col=df['A']
df['new_col']=[ sum(col<i) for i in col ]
print(df)
Result:
A new_col
0 1 0
1 3 2
2 2 1
3 5 4
4 8 5
5 3 2
There might be more efficient ways to do this on large datasets, such as sorting your column first.

merge groupby results directly back to dataframe

Suppose I have the following data:
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
id1 id2 x
0 1 1 10
1 1 2 20
2 1 3 50
3 2 1 15
4 2 2 20
5 2 3 30
6 3 1 40
7 3 2 70
The dataframe is sorted along the two ids. Suppose I'd like to know the value of x of the FIRST observation within each group of id1 observations. The result would be like
id1 id2 x first_x
1 1 10 10
1 2 30 10
1 3 50 10
2 1 15 15
2 2 20 15
2 3 30 15
3 1 40 40
3 2 70 40
How do I achieve this 'subscripting'? Ideally, the new column would be filled for each observation.
I thought along the lines of
df['first_x'] = df.groupby(['id1'])[0]

I think simpliest is transform with first:
df['first_x'] = df.groupby('id1')['x'].transform('first')
Or map by Series created by drop_duplicates:
df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
print (df)
id1 id2 x first_x
0 1 1 10 10
1 1 2 20 10
2 1 3 50 10
3 2 1 15 15
4 2 2 20 15
5 2 3 30 15
6 3 1 40 40
7 3 2 70 40
First is shortest and fastest solution:
np.random.seed(123)
N = 1000000
L = list('abcde')
df = pd.DataFrame({'id1': np.random.randint(10000,size=N),
'x':np.random.randint(10000,size=N)})
df = df.sort_values('id1').reset_index(drop=True)
print (df)
In [179]: %timeit df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')
10 loops, best of 3: 125 ms per loop
In [180]: %%timeit
...: first_xs = df.groupby(['id1']).first().to_dict()['x']
...:
...: df['first_x'] = df['id1'].map(lambda id: first_xs[id])
...:
1 loop, best of 3: 524 ms per loop
In [181]: %timeit df['first_x'] = df.groupby('id1')['x'].transform('first')
10 loops, best of 3: 54.9 ms per loop
In [182]: %timeit df['first_x'] = df['id1'].map(df.drop_duplicates('id1').set_index('id1')['x'])
10 loops, best of 3: 142 ms per loop

Something like this?
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
df = df.join(df.groupby(['id1'])['x'].first(), on='id1', how='left', lsuffix='', rsuffix='_first')

As you need to consider the entire dataframe when building values for each row, you need an intermediate step.
The following gets your first_x value using a group by, then uses that as a map to add a new column.
import pandas as pd
df = pd.DataFrame(data = [[1,1,10],[1,2,20],[1,3,50],[2,1,15],[2,2,20],[2,3,30],[3,1,40],[3,2,70]],columns=['id1','id2','x'])
first_xs = df.groupby(['id1']).first().to_dict()['x']
df['first_x'] = df['id1'].map(lambda id: first_xs[id])

How to duplicate Python dataframe one by one? [duplicate]

This question already has answers here:
How can I replicate rows of a Pandas DataFrame?
(10 answers)
Closed 2 years ago.
I have a pandas.DataFrame as follows:
df1 =
a b
0 1 2
1 3 4
I'd like to make this three times to become:
df2 =
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4
df2 is made from a loop, but it is not efficient.
How can I get df2 from df1 using a matrix way which is faster?

Build a one dimensional indexer to slice both the the values array and index. You must take care of the index as well to get your desired results.
use np.repeat on an np.arange to get the indexer
construct a new dataframe using this indexer on both values and the index
r = np.arange(len(df)).repeat(3)
pd.DataFrame(df.values[r], df.index[r], df.columns)
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4

You can use np.repeat
df = pd.DataFrame(np.repeat(df.values,[3,3], axis = 0), columns = df.columns)
You get
a b
0 1 2
1 1 2
2 1 2
3 3 4
4 3 4
5 3 4
Time testing:
%timeit pd.DataFrame(np.repeat(df.values,[3,3], axis = 0))
1000 loops, best of 3: 235 µs per loop
%timeit pd.concat([df] * 3).sort_index()
best of 3: 1.26 ms per loop
Numpy is definitely faster in most cases so no surprises there
EDIT: I am not sure if you would be looking for repeating indices but incase you do,
pd.DataFrame(np.repeat(df.values,3, axis = 0), index = np.repeat(df.index, 3), columns = df.columns)

I do not know if it is more efficient than your loop, but it easy enough to construct as:
Code:
pd.concat([df] * 3).sort_index()
Test Code:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('ab'))
print(pd.concat([df] * 3).sort_index())
Results:
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4

You can use numpy.repeat with parameter scalar 3 and then add columns parameter to DataFrame constructor:
df = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print (df)
a b
0 1 2
1 1 2
2 1 2
3 3 4
4 3 4
5 3 4
If really want duplicated index what can complicated some pandas functions like reindex which failed:
r = np.repeat(np.arange(len(df.index)), 3)
df = pd.DataFrame(df.values[r], df.index[r], df.columns)
print (df)
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4

Not the fastest (not the slowest either) but the shortest solution so far.
#Build a index array and extract the rows to build the desired new df. This handles index and data all at once.
df.iloc[np.repeat(df.index,3)]
Out[270]: In [271]:
a b
0 1 2
0 1 2
0 1 2
1 3 4
1 3 4
1 3 4

Pandas: assign an index to each group identified by groupby

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)

Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3

Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.

A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira

Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3

A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop

I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas merge and fill a dataframe with summary data - python

Other way of merging : frameA.merge(frameB,on ='title', how ='left') title value_x value_y 0 a 1 10 1 a 2 10 2 a 3 10 3 b 4 20 4 b 5 20 5 b 6 20

What you want is a left join. http://pandas.pydata.org/pandas-docs/dev/merging.html pd.merge(frameA,frameB,on='title',how='left') Out: title value_x value_y 0 a 1 10 1 a 2 10 2 a 3 10 3 b 4 20 4 b 5 20 5 b 6 20

Related

I want to add sub-index in python with pandas [duplicate]

Sum of count where values are less than row

merge groupby results directly back to dataframe

How to duplicate Python dataframe one by one? [duplicate]

Pandas: assign an index to each group identified by groupby

Categories

Resources