dataframe new column based on groupby operations - python

import pandas
import numpy
df = pandas.DataFrame({'id_1' : [1,2,1,1,1,1,1,2,2,2,2],
'id_2' : [1,1,1,1,1,2,2,2,2,2,2],
'v_1' : [2,1,1,3,2,1,2,4,1,1,2],
'v_2' : [1,1,1,1,2,2,2,1,1,2,2],
'v_3' : [3,3,3,3,4,4,4,3,3,3,3]})
In [4]: df
Out[4]:
id_1 id_2 v_1 v_2 v_3
0 1 1 2 1 3
1 2 1 1 1 3
2 1 1 1 1 3
3 1 1 3 1 3
4 1 1 2 2 4
5 1 2 1 2 4
6 1 2 2 2 4
7 2 2 4 1 3
8 2 2 1 1 3
9 2 2 1 2 3
10 2 2 2 2 3
sub = df[(df['id_1'] == 1) & (df['id_2'] == 1)].copy()
sub['v_4'] = numpy.where(sub['v_1'] == sub['v_2'].shift(), 'A', \
numpy.where(sub['v_1'] == sub['v_3'].shift(), 'B', 'C'))
In [6]: sub
Out[6]:
id_1 id_2 v_1 v_2 v_3 v_4
0 1 1 2 1 3 C
2 1 1 1 1 3 A
3 1 1 3 1 3 B
4 1 1 2 2 4 C
I have a dataframe as defined above. I would like to perform some operation, basically categorize whether v_1 equals the previous v_2 or v_3 for each group of (id_1, id_2)
I have done the the operation which performs on a sub df. And I would like to have a one line code to combine the following groupby together with the operation I have on the sub df together.
gbdf = df.groupby(by=['id_1', 'id_2'])
I have tried something like
gbdf['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
and the error was
'DataFrameGroupBy' object does not support item assignment
I also tried
df['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
which I believe the result was wrong, it does not align the groupby result with the original ordering.
I am wondering whether there is an elegant way to achieve this.

This gets you a list of dataframes that match the content of the dataframe sub, but for all results of the .groupby():
import numpy
import pandas
source = pandas.DataFrame(
{'id_1': [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2],
'id_2': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'v_1': [2, 1, 1, 3, 2, 1, 2, 4, 1, 1, 2],
'v_2': [1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2],
'v_3': [3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3]})
def add_v4(df):
df['v_4'] = numpy.where(df['v_1'] == df['v_2'].shift(), 'A', numpy.where(df['v_1'] == df['v_3'].shift(), 'B', 'C'))
return df
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
print(dfs)
About this line:
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
It's a list comprehension that gets all the slices from the groupby and turns them into actual new dataframes before passing them to add_v4, which returns the modified dataframe to be added to the list.

Related

Dataframe Concatenation with Pandas

i'm quite new to pandas and i'm stuck with this dataframe concatenation.
Let's say i've 2 dataframes:
df_1=pd.DataFrame({
"A":[1, 1, 2, 2, 3, 4, 4],
"B":[1, 2, 1, 2, 1, 1, 3],
"C":['a','b','c','d','e','f','g']
})
and
df_2=pd.DataFrame({
"A":[1, 3, 4],
"D":[1,'m',7]
})
I would like to concatenate/merge the 2 dataframes on the same values of ['A'] so that the resulting dataframe is:
df_3=pd.DataFrame({
"A":[1, 1, 3, 4, 4],
"B":[1, 2, 1, 1, 3],
"C":['a','b','e','f','g'],
"D":[1, 1, 'm', 7, 7]
})
How can i do that?
Thanks in advance
Just do an inner merge:
df_1.merge(df_2, how="inner", on="A")
outputs
A B C D
0 1 1 a 1
1 1 2 b 1
2 3 1 e m
3 4 1 f 7
4 4 3 g 7
You can also do a left merge, and then dropna
df_3 = df_1.merge(df_2, on=['A'], how='left').dropna(axis=0)
Output:
A B C D
0 1 1 a 1
1 1 2 b 1
4 3 1 e m
5 4 1 f 7
6 4 3 g 7

Creating a derived column using pandas operations

I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()
Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])
For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2

Create new column on grouped data frame

I want to create new column that is calculated by groups using multiple columns from current data frame. Basically something like this in R (tidyverse):
require(tidyverse)
data <- data_frame(
a = c(1, 2, 1, 2, 3, 1, 2),
b = c(1, 1, 1, 1, 1, 1, 1),
c = c(1, 0, 1, 1, 0, 0, 1),
)
data %>%
group_by(a) %>%
mutate(d = cumsum(b) * c)
In pandas I think I should use groupby and apply to create new column and then assign it to the original data frame. This is what I've tried so far:
import numpy as np
import pandas as pd
def create_new_column(data):
return np.cumsum(data['b']) * data['c']
data = pd.DataFrame({
'a': [1, 2, 1, 2, 3, 1, 2],
'b': [1, 1, 1, 1, 1, 1, 1],
'c': [1, 0, 1, 1, 0, 0, 1],
})
# assign - throws error
data['d'] = data.groupby('a').apply(create_new_column)
# assign without index - incorrect order in output
data['d'] = data.groupby('a').apply(create_new_column).values
# assign to sorted data frame
data_sorted = data.sort_values('a')
data_sorted['d'] = data_sorted.groupby('a').apply(create_new_column).values
What is preferred way (ideally without sorting the data) to achieve this?
Add parameter group_keys=False for avoid MultiIndex, so possible assign back to new column:
data['d'] = data.groupby('a', group_keys=False).apply(create_new_column)
Alternative is remove first level:
data['d'] = data.groupby('a').apply(create_new_column).reset_index(level=0, drop=True)
print (data)
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
Detail:
print (data.groupby('a').apply(create_new_column))
a
1 0 1
2 2
5 0
2 1 0
3 2
6 3
3 4 0
dtype: int64
print (data.groupby('a', group_keys=False).apply(create_new_column))
0 1
2 2
5 0
1 0
3 2
6 3
4 0
dtype: int64
Now you can also implement it in python with datar in the way exactly you did in R:
>>> from datar.all import c, f, tibble, cumsum
>>>
>>> data = tibble(
... a = c(1, 2, 1, 2, 3, 1, 2),
... b = c(1, 1, 1, 1, 1, 1, 1),
... c = c(1, 0, 1, 1, 0, 0, 1),
... )
>>>
>>> (data >>
... group_by(f.a) >>
... mutate(d=cumsum(f.b) * f.c))
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
[Groups: ['a'] (n=3)]
I am the author of the package. Feel free to submit issues if you have any questions.

Create a column in a dataframe that is a string of characters summarizing data in other columns

I have a dataframe like this where the columns are the scores of some metrics:
A B C D
4 3 3 1
2 5 2 2
3 5 2 4
I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:
A B C D NewCol
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD
I tried using a series of np.where:
df[NewCol] = np.where(df['A'] > 2, 'A', '')
df[NewCol] = np.where(df['B'] > 3, 'B', '')
etc.
but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:
A B C D NewCol
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD
I am pretty sure there is an easier and correct way to do this.
You could do:
import pandas as pd
data = [[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}
df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]
print(df)
Output
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD
Using dot
s=pd.Series([2,3,1,3],index=df.columns)
df.gt(s,1).dot(df.columns)
Out[179]:
0 AC
1 BC
2 ABCD
dtype: object
#df['New']=df.gt(s,1).dot(df.columns)
Another option that operates in an array fashion. It would be interesting to compare performance.
import pandas as pd
import numpy as np
# Data to test.
data = pd.DataFrame(
[
[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]
]
, columns = ['A', 'B', 'C', 'D']
)
# Series to hold the thresholds.
thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])
# Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.
data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)
print(data)
Gives:
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD

Conditional multi column matching (reviewed with new example)

I tried to rework my question in order to match the quality criteria and spent more time in trying to achieve the result on my own.
Given are two DataFrames
a = DataFrame({"id" : ["id1"] * 3 + ["id2"] * 3 + ["id3"] * 3,
"left" : [6, 2, 5, 2, 1, 4, 5, 2, 4],
"right" : [1, 3, 4, 6, 5, 3, 6, 3, 2]
})
b = DataFrame({"id" : ["id1"] * 6 + ["id2"] * 6 + ["id3"] * 6,
"left_and_right" : range(1,7) * 3,
"boolen" : [0, 0, 1, 0, 1, 0, 1, 0, 0 , 1, 1, 0, 0, 0, 1, 0, 0, 1]
})
The expected result is
result = DataFrame({"id" : ["id1"] * 3 + ["id2"] * 3 + ["id3"] * 3,
"left" : [6, 2, 5, 2, 1, 4, 5, 2, 4],
"right" : [1, 3, 4, 6, 5, 3, 6, 3, 2],
"NEW": [0, 1, 1, 0, 1, 1, 1, 1, 0]
})
So I want to check in each row of DataFrame b if there is a row in DataFrame a where a.id == b.id AND then look up if b.left_and_right is in (==) a.left OR a.rigtht. If such a row is found and b.boolen is True/1 for either the value of a.left or a.right, the value of a.NEW in this row should be also True/1.
I hope the example illustrates it better than my words.
To sum it up: I want to look up if in each row where id is the same for both DataFrames whether b.boolen is True/1 for a value in b.left_and_right and if this value is in a.left or in a.right, the new value in a.NEW should also be TRUE/1.
I have tried using the pd.match() and pd.merge() function in combination with & and | operators but could not achieve the wanted result.
Some time ago I had asked a very simillar question dealing with a simillar problem in R (data was organized in a slightly other way, so it was a bit different) but now I fail using the same approach in python.
Related question: Conditional matching of two lists with multi-column data.frames
Thanks
Just use a boolean masks with & (and) and | (or):
In [11]: (a.A == b.A) & ((a.B == b.E) | (a.C == b.E)) # they all satisfy this requirement!
Out[11]:
0 True
1 True
2 True
3 True
dtype: bool
In [12]: b.D[(a.A == b.A) & ((a.B == b.E) | (a.C == b.E))]
Out[12]:
0 0
1 1
2 0
3 0
Name: D, dtype: int64
In [13]: a['NEW'] = b.D[(a.A == b.A) & ((a.B == b.E) | (a.C == b.E))]
In [14]: a
Out[14]:
A B C NEW
0 foo 1 4 0
1 goo 2 3 1
2 doo 3 1 0
3 boo 4 2 0
Update with the slightly different question:
In [21]: merged = pd.merge(a, b, on='id')
In [22]: matching = merged[(merged.left == merged.left_and_right) | (merged.right == merged.left_and_right)]
In [23]: (matching.groupby(['id', 'left', 'right'])['boolen'].sum()).reset_index()
Out[23]:
id left right boolen
0 id1 2 3 1
1 id1 5 4 1
2 id1 6 1 0
3 id2 1 5 2
4 id2 2 6 0
5 id2 4 3 1
6 id3 2 3 1
7 id3 4 2 0
8 id3 5 6 1
Note there is a 2 here, so perhaps you want to care about those > 0.
In [24]: (matching.groupby(['id', 'left', 'right'])['boolen'].sum() > 0).reset_index()
Out[24]:
id left right boolen
0 id1 2 3 True
1 id1 5 4 True
2 id1 6 1 False
3 id2 1 5 True
4 id2 2 6 False
5 id2 4 3 True
6 id3 2 3 True
7 id3 4 2 False
8 id3 5 6 True
You may want to rename the boolen column to NEW.

Categories