Conditional multi column matching (reviewed with new example) - python

I tried to rework my question in order to match the quality criteria and spent more time in trying to achieve the result on my own.
Given are two DataFrames
a = DataFrame({"id" : ["id1"] * 3 + ["id2"] * 3 + ["id3"] * 3,
"left" : [6, 2, 5, 2, 1, 4, 5, 2, 4],
"right" : [1, 3, 4, 6, 5, 3, 6, 3, 2]
})
b = DataFrame({"id" : ["id1"] * 6 + ["id2"] * 6 + ["id3"] * 6,
"left_and_right" : range(1,7) * 3,
"boolen" : [0, 0, 1, 0, 1, 0, 1, 0, 0 , 1, 1, 0, 0, 0, 1, 0, 0, 1]
})
The expected result is
result = DataFrame({"id" : ["id1"] * 3 + ["id2"] * 3 + ["id3"] * 3,
"left" : [6, 2, 5, 2, 1, 4, 5, 2, 4],
"right" : [1, 3, 4, 6, 5, 3, 6, 3, 2],
"NEW": [0, 1, 1, 0, 1, 1, 1, 1, 0]
})
So I want to check in each row of DataFrame b if there is a row in DataFrame a where a.id == b.id AND then look up if b.left_and_right is in (==) a.left OR a.rigtht. If such a row is found and b.boolen is True/1 for either the value of a.left or a.right, the value of a.NEW in this row should be also True/1.
I hope the example illustrates it better than my words.
To sum it up: I want to look up if in each row where id is the same for both DataFrames whether b.boolen is True/1 for a value in b.left_and_right and if this value is in a.left or in a.right, the new value in a.NEW should also be TRUE/1.
I have tried using the pd.match() and pd.merge() function in combination with & and | operators but could not achieve the wanted result.
Some time ago I had asked a very simillar question dealing with a simillar problem in R (data was organized in a slightly other way, so it was a bit different) but now I fail using the same approach in python.
Related question: Conditional matching of two lists with multi-column data.frames
Thanks

Just use a boolean masks with & (and) and | (or):
In [11]: (a.A == b.A) & ((a.B == b.E) | (a.C == b.E)) # they all satisfy this requirement!
Out[11]:
0 True
1 True
2 True
3 True
dtype: bool
In [12]: b.D[(a.A == b.A) & ((a.B == b.E) | (a.C == b.E))]
Out[12]:
0 0
1 1
2 0
3 0
Name: D, dtype: int64
In [13]: a['NEW'] = b.D[(a.A == b.A) & ((a.B == b.E) | (a.C == b.E))]
In [14]: a
Out[14]:
A B C NEW
0 foo 1 4 0
1 goo 2 3 1
2 doo 3 1 0
3 boo 4 2 0
Update with the slightly different question:
In [21]: merged = pd.merge(a, b, on='id')
In [22]: matching = merged[(merged.left == merged.left_and_right) | (merged.right == merged.left_and_right)]
In [23]: (matching.groupby(['id', 'left', 'right'])['boolen'].sum()).reset_index()
Out[23]:
id left right boolen
0 id1 2 3 1
1 id1 5 4 1
2 id1 6 1 0
3 id2 1 5 2
4 id2 2 6 0
5 id2 4 3 1
6 id3 2 3 1
7 id3 4 2 0
8 id3 5 6 1
Note there is a 2 here, so perhaps you want to care about those > 0.
In [24]: (matching.groupby(['id', 'left', 'right'])['boolen'].sum() > 0).reset_index()
Out[24]:
id left right boolen
0 id1 2 3 True
1 id1 5 4 True
2 id1 6 1 False
3 id2 1 5 True
4 id2 2 6 False
5 id2 4 3 True
6 id3 2 3 True
7 id3 4 2 False
8 id3 5 6 True
You may want to rename the boolen column to NEW.

Related

dataframe new column based on groupby operations

import pandas
import numpy
df = pandas.DataFrame({'id_1' : [1,2,1,1,1,1,1,2,2,2,2],
'id_2' : [1,1,1,1,1,2,2,2,2,2,2],
'v_1' : [2,1,1,3,2,1,2,4,1,1,2],
'v_2' : [1,1,1,1,2,2,2,1,1,2,2],
'v_3' : [3,3,3,3,4,4,4,3,3,3,3]})
In [4]: df
Out[4]:
id_1 id_2 v_1 v_2 v_3
0 1 1 2 1 3
1 2 1 1 1 3
2 1 1 1 1 3
3 1 1 3 1 3
4 1 1 2 2 4
5 1 2 1 2 4
6 1 2 2 2 4
7 2 2 4 1 3
8 2 2 1 1 3
9 2 2 1 2 3
10 2 2 2 2 3
sub = df[(df['id_1'] == 1) & (df['id_2'] == 1)].copy()
sub['v_4'] = numpy.where(sub['v_1'] == sub['v_2'].shift(), 'A', \
numpy.where(sub['v_1'] == sub['v_3'].shift(), 'B', 'C'))
In [6]: sub
Out[6]:
id_1 id_2 v_1 v_2 v_3 v_4
0 1 1 2 1 3 C
2 1 1 1 1 3 A
3 1 1 3 1 3 B
4 1 1 2 2 4 C
I have a dataframe as defined above. I would like to perform some operation, basically categorize whether v_1 equals the previous v_2 or v_3 for each group of (id_1, id_2)
I have done the the operation which performs on a sub df. And I would like to have a one line code to combine the following groupby together with the operation I have on the sub df together.
gbdf = df.groupby(by=['id_1', 'id_2'])
I have tried something like
gbdf['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
and the error was
'DataFrameGroupBy' object does not support item assignment
I also tried
df['v_4'] = numpy.where(gbdf['v_1'] == gbdf['v_2'].shift(), 'A', \
numpy.where(gbdf['v_1'] == gbdf['v_3'].shift(), 'B', 'C'))
which I believe the result was wrong, it does not align the groupby result with the original ordering.
I am wondering whether there is an elegant way to achieve this.
This gets you a list of dataframes that match the content of the dataframe sub, but for all results of the .groupby():
import numpy
import pandas
source = pandas.DataFrame(
{'id_1': [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2],
'id_2': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
'v_1': [2, 1, 1, 3, 2, 1, 2, 4, 1, 1, 2],
'v_2': [1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2],
'v_3': [3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3]})
def add_v4(df):
df['v_4'] = numpy.where(df['v_1'] == df['v_2'].shift(), 'A', numpy.where(df['v_1'] == df['v_3'].shift(), 'B', 'C'))
return df
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
print(dfs)
About this line:
dfs = [add_v4(pandas.DataFrame(slice)) for _, slice in source.groupby(by=['id_1', 'id_2'])]
It's a list comprehension that gets all the slices from the groupby and turns them into actual new dataframes before passing them to add_v4, which returns the modified dataframe to be added to the list.

Creating a derived column using pandas operations

I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()
Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])
For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2

Create new column on grouped data frame

I want to create new column that is calculated by groups using multiple columns from current data frame. Basically something like this in R (tidyverse):
require(tidyverse)
data <- data_frame(
a = c(1, 2, 1, 2, 3, 1, 2),
b = c(1, 1, 1, 1, 1, 1, 1),
c = c(1, 0, 1, 1, 0, 0, 1),
)
data %>%
group_by(a) %>%
mutate(d = cumsum(b) * c)
In pandas I think I should use groupby and apply to create new column and then assign it to the original data frame. This is what I've tried so far:
import numpy as np
import pandas as pd
def create_new_column(data):
return np.cumsum(data['b']) * data['c']
data = pd.DataFrame({
'a': [1, 2, 1, 2, 3, 1, 2],
'b': [1, 1, 1, 1, 1, 1, 1],
'c': [1, 0, 1, 1, 0, 0, 1],
})
# assign - throws error
data['d'] = data.groupby('a').apply(create_new_column)
# assign without index - incorrect order in output
data['d'] = data.groupby('a').apply(create_new_column).values
# assign to sorted data frame
data_sorted = data.sort_values('a')
data_sorted['d'] = data_sorted.groupby('a').apply(create_new_column).values
What is preferred way (ideally without sorting the data) to achieve this?
Add parameter group_keys=False for avoid MultiIndex, so possible assign back to new column:
data['d'] = data.groupby('a', group_keys=False).apply(create_new_column)
Alternative is remove first level:
data['d'] = data.groupby('a').apply(create_new_column).reset_index(level=0, drop=True)
print (data)
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
Detail:
print (data.groupby('a').apply(create_new_column))
a
1 0 1
2 2
5 0
2 1 0
3 2
6 3
3 4 0
dtype: int64
print (data.groupby('a', group_keys=False).apply(create_new_column))
0 1
2 2
5 0
1 0
3 2
6 3
4 0
dtype: int64
Now you can also implement it in python with datar in the way exactly you did in R:
>>> from datar.all import c, f, tibble, cumsum
>>>
>>> data = tibble(
... a = c(1, 2, 1, 2, 3, 1, 2),
... b = c(1, 1, 1, 1, 1, 1, 1),
... c = c(1, 0, 1, 1, 0, 0, 1),
... )
>>>
>>> (data >>
... group_by(f.a) >>
... mutate(d=cumsum(f.b) * f.c))
a b c d
0 1 1 1 1
1 2 1 0 0
2 1 1 1 2
3 2 1 1 2
4 3 1 0 0
5 1 1 0 0
6 2 1 1 3
[Groups: ['a'] (n=3)]
I am the author of the package. Feel free to submit issues if you have any questions.

Pandas create a unique id for each row based on a condition

I've a dataset where one of the column is as below. I'd like to create a new column based on the below condition.
For values in column_name, if 1 is present, create a new id. If 0 is present, also create a new id. But if 1 is repeated in more than 1 continuous rows, then id should be same for all rows. The sample output result can be seen below.
column_name
1
0
0
1
1
1
1
0
0
1
column_name -- ID
1 -- 1
0 -- 2
0 -- 3
1 -- 4
1 -- 4
1 -- 4
1 -- 4
0 -- 5
0 -- 6
1 -- 7
Say your Series is
s = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
Then you can use:
>>> ((s != 1) | (s.shift(1) != 1)).cumsum()
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
dtype: int64
This checks that either the current entry is not 1, or that the previous entry is not 1, and then performs a cumulative sum on the result.
Essentially leveraging the fact that a 1 in the Series lagged by another 1 should be treated as part of the same group, while every 0 calls for an increment. One of four things will happen:
1) 0 with a preceding 0 : Increment by 1
2) 0 with a preceding 1 : Increment by 1
3) 1 with a preceding 1 : Increment by 0
4) 1 with a preceding 0: Increment by 1
(df['column_name'] + df['column_name'].shift(1)).\ ## Creates a Series with values 0, 1, or 2 (first field is NaN)
fillna(0).\ ## Fills first field with 0
isin([0,1]).\ ## True for cases 1, 2, and 4 described above, else False (case 3)
astype('int').\ ## Integerizes it
cumsum()
Output:
0 1
1 2
2 3
3 4
4 4
5 4
6 4
7 5
8 6
9 7
At this stage I would just use a regular python for loop
column_name = pd.Series([1, 0, 0, 1, 1, 1, 1, 0, 0, 1])
ID = [1]
for i in range(1, len(column_name)):
ID.append(ID[-1] + ((column_name[i] + column_name[i-1]) < 2))
print(ID)
>>> [1, 2, 3, 4, 4, 4, 4, 5, 6, 7]
And then you can assign ID as a column in your dataframe

Pandas: group columns of duplicate rows into column of lists

I have a Pandas dataframe that looks something like this:
>>> df
m event
0 3 1
1 1 1
2 1 2
3 1 2
4 2 1
5 2 0
6 3 1
7 2 2
8 3 2
9 3 1
I want to group the values of the event column into lists based on the m column so that I would get this:
>>> df
m events
0 3 [1, 1, 2, 1]
1 1 [1, 2, 2]
2 2 [1, 0, 2]
There should be one row per unique value of m with a corresponding list of all events that belongs to m.
I tried this:
>>> list(df.groupby('m').event)
[(3, m_id
0 1
6 1
8 2
9 1
Name: event, dtype: int64), (1, m_id
1 1
2 2
3 2
Name: event, dtype: int64), (2, m_id
4 1
5 0
7 2
Name: event, dtype: int64)]
It sort of does what I want in that it groups the events after m. I could massage this back into the dataframe that I wanted with some loops, but I feel that I have started on an ugly an unnecessarily complex path. And slow, if there are thousands of unique values for m.
Can I perform the conversion I wanted in an elegant manner using Pandas methods?
Bonus if the events column can contain (numpy) arrays so that I can do math directly on the events rows, like df[df.m==1].events + 100, but regular lists are also ok.
In [320]: r = df.groupby('m')['event'].apply(np.array).reset_index(name='event')
In [321]: r
Out[321]:
m event
0 1 [1, 2, 2]
1 2 [1, 0, 2]
2 3 [1, 1, 2, 1]
Bonus:
In [322]: r.loc[r.m==1, 'event'] + 1
Out[322]:
0 [2, 3, 3]
Name: event, dtype: object
You could
In [1163]: df.groupby('m')['event'].apply(list).reset_index(name='events')
Out[1163]:
m events
0 1 [1, 2, 2]
1 2 [1, 0, 2]
2 3 [1, 1, 2, 1]
If you don't want sorted m
In [1164]: df.groupby('m', sort=False).event.apply(list).reset_index(name='events')
Out[1164]:
m events
0 3 [1, 1, 2, 1]
1 1 [1, 2, 2]
2 2 [1, 0, 2]

Categories