I have two dataframes A and B
A = pd.DataFrame({'a'=[1,2,3,4,5], 'b'=[11,22,33,44,55]})
B = pd.DataFrame({'a'=[7,2,3,4,9], 'b'=[123,234,456,789,1122]})
I want to merge B and A such that I don't want the common values in column 'a' in A and B from B, only non-intersecting values from B in column 'a' should be taken. The final dataframe should look like
a
b
1
11
2
22
3
33
4
44
5
55
7
123
9
1122
If a is unique-valued in both A and B (some sort of unique ID for example), you can try with concat and drop_duplicates:
pd.concat([A,B]).drop_duplicates('a')
Output:
a b
0 1 11
1 2 22
2 3 33
3 4 44
4 5 55
0 7 123
4 9 1122
In the general case, use isin to check for existence of B['a'] in A['a']:
pd.concat([A,B[~B['a'].isin(A['a'])])
Related
I'm trying to create a new column in a df. I want the new column to equal the count of the number rows of each unique 'mother_ID, which is a different column in the df.
This is what I'm currently doing. It makes the new column but the new column is filled with 'NaN's.
df.columns = ['mother_ID', 'date_born', 'mother_mass_g', 'hatchling_masses_g']
df.to_numpy()
This is how the original df appears when I print it:
count = df.groupby('mother_ID').hatchling_masses_g.count()
df['count']= count
Pic below shows what I get when I print new df, although if I simply print(count) I get the correct counts for each mother_ID . Does anyone know what I'm doing wrong?
Use groupby transform('count'):
df['count'] = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
Notice the difference between groupby count and groupby tranform with 'count'.
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame({
'mother_ID': np.random.choice(['a', 'b'], 10),
'hatchling_masses_g': np.random.randint(1, 100, 10)
})
mother_ID hatchling_masses_g
0 b 63
1 a 28
2 b 31
3 b 81
4 a 8
5 a 77
6 a 16
7 b 54
8 a 81
9 a 28
groupby.count
counts = df.groupby('mother_ID')['hatchling_masses_g'].count()
mother_ID
a 6
b 4
Name: hatchling_masses_g, dtype: int64
Notice how there are only 2 rows. When assigning back to the DataFrame there are 10 rows which means that pandas doesn't know how to align the data back. Which results in NaNs indicating missing data:
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 NaN
1 a 28 NaN
2 b 31 NaN
3 b 81 NaN
4 a 8 NaN
5 a 77 NaN
6 a 16 NaN
7 b 54 NaN
8 a 81 NaN
9 a 28 NaN
It's trying to find 'a' and 'b' in the index and since it cannot it fills with only NaN values.
groupby.tranform('count')
transform, on the other hand, will populate the entire group with the count:
counts = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
counts:
0 4
1 6
2 4
3 4
4 6
5 6
6 6
7 4
8 6
9 6
Name: hatchling_masses_g, dtype: int64
Notice 10 rows were created (one for every row in the DataFrame):
This assigns back to the dataframe nicely (since the indexes align):
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6
If needed counts can be done via groupby count, then join back to the DataFrame on the group key:
counts = df.groupby('mother_ID')['hatchling_masses_g'].count().rename('count')
df = df.join(counts, on='mother_ID')
counts:
mother_ID
a 6
b 4
Name: count, dtype: int64
df:
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6
I have a dataset with several columns of cumulative sums. For every row, I want to return the first column number that satisfies a condition.
Toy example:
df = pd.DataFrame(np.array(range(20)).reshape(4,5).T).cumsum(axis=1)
>>> df
0 1 2 3
0 0 5 15 30
1 1 7 18 34
2 2 9 21 38
3 3 11 24 42
4 4 13 27 46
If I want to return the first column whose value is greater than 20 for instance.
Desired output:
3
3
2
2
2
Many thanks as always!
Try with idxmax
df.gt(20).idxmax(1)
Out[66]:
0 3
1 3
2 2
3 2
4 2
dtype: object
No as short as #YOBEN_S but works is the chaining of index.get_loc and first_valid_index
df[df>20].apply(lambda x: x.index.get_loc(x.first_valid_index()), axis=1)
0 3
1 3
2 2
3 2
4 2
dtype: int64
I have a dataset with multiple columns and rows. The rows are supposed to be summed up based on the unique value in a column. I tried .groupby but I want to retain the whole dataset and not just summed up columns based on one unique column. I further need to multiple these individual columns(values) with another column.
For example:
id A B C D E
11 2 1 2 4 100
11 2 2 1 1 100
12 1 3 2 2 200
13 3 1 1 4 190
14 Nan 1 2 2 300
I would like to sum up columns B, C & D based on the unique id and then multiply the result by column A and E in a new column F. I do not want to sum up the values of column A & E
I would like the resultant dataframe to be something like this, which also deals with NaN and while calculating skips the NaN value and moves onto further calculation:
id A B C D E F
11 2 3 3 5 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
If the above is unachievable then I would like something as, where the rows are same but the calculation is what I have stated above based on the same id:
id A B C D E F
11 2 3 3 5 100 9000
11 2 2 1 1 100 9000
12 1 3 2 2 200 2400
13 3 1 1 4 190 2280
14 Nan 1 2 2 300 1200
My logic earlier was to apply groupby on the columns B, C, D and then multiply but that is not working out for me. If the above dataframes are unachieavable then please let me know how can i perform this calculation and then merge/join the results with the original file with just E column.
You must first sum verticaly the columns B, C and D for common id, then take the horizontal product:
result = df.groupby('id').agg({'A': 'first', 'B':'sum', 'C': 'sum', 'D': 'sum',
'E': 'first'})
result['F'] = result.fillna(1).astype('int64').agg('prod', axis=1)
It gives:
A B C D E F
id
11 2.0 3 3 5 100 9000
12 1.0 3 2 2 200 2400
13 3.0 1 1 4 190 2280
14 NaN 1 2 2 300 1200
Beware: id is the index here - use reset_index if you want it to be a normal column.
After a groupby, when using agg, if a dict of columns:functions is passed, the functions will be applied in the corresponding columns. Nevertheless this syntax doesn't work with transform. Is there another way to apply several functions in transform?
Let's give an example:
import pandas as pd
df_test = pd.DataFrame([[1,2,3],[1,20,30],[2,30,50],[1,2,33],[2,4,50]],columns = ['a','b','c'])
Out[1]:
a b c
0 1 2 3
1 1 20 30
2 2 30 50
3 1 2 33
4 2 4 50
def my_fct1(series):
return series.mean()
def my_fct2(series):
return series.std()
df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2})
Out[2]:
c b
a
1 16.522712 8
2 0.000000 17
The previous example shows how to apply different function to different columns in agg, but if we want to transform the columns without aggregating them, agg can't be used anymore. Therefore:
df_test.groupby('a').transform({'b':np.cumsum,'c':np.cumprod})
Out[3]:
TypeError: unhashable type: 'dict'
How can we perform such an action with the following expected output:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
You can still use a dict but with a bit of hack:
df_test.groupby('a').transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])
Out[427]:
b c
0 2 3
1 22 90
2 30 50
3 24 2970
4 34 2500
If you need to keep column a, you can do:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: {'b': x.cumsum(), 'c': x.cumprod()}[x.name])\
.reset_index()
Out[429]:
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Another way is to use an if else to check column names:
df_test.set_index('a')\
.groupby('a')\
.transform(lambda x: x.cumsum() if x.name=='b' else x.cumprod())\
.reset_index()
I think now (pandas 0.20.2) function transform is not implemented with dict - columns names with functions like agg.
If functions return Series with same lenght:
df1 = df_test.set_index('a').groupby('a').agg({'b':np.cumsum,'c':np.cumprod}).reset_index()
print (df1)
a c b
0 1 3 2
1 1 90 22
2 2 50 30
3 1 2970 24
4 2 2500 34
But if aggreagte different length need join:
df2 = df_test[['a']].join(df_test.groupby('a').agg({'b':my_fct1,'c':my_fct2}), on='a')
print (df2)
a c b
0 1 16.522712 8
1 1 16.522712 8
2 2 0.000000 17
3 1 16.522712 8
4 2 0.000000 17
With the updates to Pandas, you can use the assign method, along with transform to either append new columns, or replace existing columns with new values :
grouper = df_test.groupby("a")
df_test.assign(b=grouper["b"].transform("cumsum"),
c=grouper["c"].transform("cumprod"))
a b c
0 1 2 3
1 1 22 90
2 2 30 50
3 1 24 2970
4 2 34 2500
Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2