Pandas: group columns of duplicate rows into column of lists

Pandas: group columns of duplicate rows into column of lists - python

I have a Pandas dataframe that looks something like this:
>>> df
m event
0 3 1
1 1 1
2 1 2
3 1 2
4 2 1
5 2 0
6 3 1
7 2 2
8 3 2
9 3 1
I want to group the values of the event column into lists based on the m column so that I would get this:
>>> df
m events
0 3 [1, 1, 2, 1]
1 1 [1, 2, 2]
2 2 [1, 0, 2]
There should be one row per unique value of m with a corresponding list of all events that belongs to m.
I tried this:
>>> list(df.groupby('m').event)
[(3, m_id
0 1
6 1
8 2
9 1
Name: event, dtype: int64), (1, m_id
1 1
2 2
3 2
Name: event, dtype: int64), (2, m_id
4 1
5 0
7 2
Name: event, dtype: int64)]
It sort of does what I want in that it groups the events after m. I could massage this back into the dataframe that I wanted with some loops, but I feel that I have started on an ugly an unnecessarily complex path. And slow, if there are thousands of unique values for m.
Can I perform the conversion I wanted in an elegant manner using Pandas methods?
Bonus if the events column can contain (numpy) arrays so that I can do math directly on the events rows, like df[df.m==1].events + 100, but regular lists are also ok.

In [320]: r = df.groupby('m')['event'].apply(np.array).reset_index(name='event')
In [321]: r
Out[321]:
m event
0 1 [1, 2, 2]
1 2 [1, 0, 2]
2 3 [1, 1, 2, 1]
Bonus:
In [322]: r.loc[r.m==1, 'event'] + 1
Out[322]:
0 [2, 3, 3]
Name: event, dtype: object

You could
In [1163]: df.groupby('m')['event'].apply(list).reset_index(name='events')
Out[1163]:
m events
0 1 [1, 2, 2]
1 2 [1, 0, 2]
2 3 [1, 1, 2, 1]
If you don't want sorted m
In [1164]: df.groupby('m', sort=False).event.apply(list).reset_index(name='events')
Out[1164]:
m events
0 3 [1, 1, 2, 1]
1 1 [1, 2, 2]
2 2 [1, 0, 2]

Related

List in dataframe is different to the order it appears in the original list?

So lets say I have a list of lists:
data = [[1, 2, 3, 4], [0, 2, 3, 4], [0, 0 , 3, 4], [0, 0, 0, 4]]
When I am trying to output this into a dataframe, the dataframe is appearing as follows:
df = pd.DataFrame(data)
Current (incorrect) output:
list 1
list 2
list 3
list 4
1
2
3
4
2
3
4
0
3
4
0
0
4
0
0
0
This is the output I am hoping for:
list 1
list 2
list 3
list 4
1
0
0
0
2
2
0
0
3
3
3
0
4
4
4
4
Does anyone have any suggestion on how to fix this?

Each sublist in data is supposed to become a row in df, not a column, so you have to transpose data and then turn it into a df, like this:
df = pd.DataFrame(np.array(data).T)
Alternatively, if data contains strings, you could do:
df = pd.DataFrame(list(map(list, zip(*data))))
which transposes the list of lists and then turns into a df.

IIUC, you want to "push" the 0s to the end.
You can use sorted with a custom key.
The exact output you expect is yet ambiguous as the data is symmetric, but you can transpose if needed:
data = [[1, 2, 3, 4], [0, 2, 3, 4], [0, 0 , 3, 4], [0, 0, 0, 4]]
df = (pd.DataFrame([sorted(l, key=lambda x: x==0) for l in data])
.T # remove the ".T" of you don't want to transpose
)
NB. If you want to push the 0s to the end and sort the other values, use lambda x: (x==0, x) as key
Output:
0 1 2 3
0 1 2 3 4
1 2 3 4 0
2 3 4 0 0
3 4 0 0 0

Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?
Again for simplicity, let's assume this value_series is given by the max of each row.
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?

We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)

You can try
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64

The perfect job for numpy broadcasting:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)

How to take one part of a Series in Pandas?

How do I only take the values on the right? Ex. only have an array/list of [120, 108, 82...]
d = daily_counts.loc[daily_counts['workingday'] == "yes", 'casual']
d

You can simply use tolist() or values or to_numpy method. Here is a toy example:
>>> df = pd.DataFrame({'a':[1,2,3,4,5,7,1,4]})
>>>
a
0 1
1 2
2 3
3 4
4 5
5 7
6 1
7 4
>>> df['a'].value_counts() #generates similar output as you
>>>
4 2
1 2
7 1
5 1
3 1
2 1
Name: a, dtype: int64
>>> df['a'].value_counts().tolist() #extracting as a list
[2, 2, 1, 1, 1, 1]
>>> df['a'].value_counts().values #extracting as a numpy array
array([2, 2, 1, 1, 1, 1])
>>> df['a'].value_counts().to_numpy() #extracting as a numpy array
array([2, 2, 1, 1, 1, 1])

Creating a derived column using pandas operations

I'm trying to create a column which contains a cumulative sum of the number of entries, tid, which are grouped according to unique values of (raceid, tid). The cumulative sum should increment by the number of entries in the grouping as shown in the df3 dataframe below rather than one at a time.
import pandas as pd
df1 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3]})
rid tid
0 1 1
1 1 2
2 1 2
3 2 1
4 2 1
5 2 3
6 3 1
7 3 4
8 4 5
9 5 1
10 5 1
11 5 1
12 5 3
Giving after the required operation:
df3 = pd.DataFrame({
'rid': [1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5],
'tid': [1, 2, 2, 1, 1, 3, 1, 4, 5, 1, 1, 1, 3],
'groupentries': [1, 2, 2, 2, 2, 1, 1, 1, 1, 3, 3, 3, 1],
'cumulativeentries': [1, 2, 2, 3, 3, 1, 4, 1, 1, 7, 7, 7, 2]})
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2
The derived column that I'm after is the cumulativeentries column although I've only figured out how to generate the intermediate column groupentries using pandas:
df1.groupby(["rid", "tid"]).size()

Values in cumulativeentries are actually a kind of running count.
The task is to count occurrences of the current tid in "source area" of
tid column:
from the beginning of the DataFrame,
up to (including) the end of the current group.
To compute values of both required values for each group, I defined
the following function:
def fn(grp):
lastRow = grp.iloc[-1] # last row of the current group
lastId = lastRow.name # index of this row
tids = df1.truncate(after=lastId).tid
return [grp.index.size, tids[tids == lastRow.tid].size]
To get the "source area" mentioned above I used truncate function.
In my opinion it is a very intuitive solution, based on the notion of the
"source area".
The function returns a list containing both required values:
the size of the current group,
how many tids equal to the current tid are in the
truncated tid column.
To apply this function, run:
df2 = df1.groupby(['rid', 'tid']).apply(fn).apply(pd.Series)\
.rename(columns={0: 'groupentries', 1: 'cumulativeentries'})
Details:
apply(fn) generates a Series containing 2-element lists.
apply(pd.Series) converts it to a DataFrame (with default column names).
rename sets the target column names.
And the last thing to do is to join this table to df1:
df1.join(df2, on=['rid', 'tid'])

For first column use GroupBy.transform with DataFrameGroupBy.size, for second use custom function for test all values of column to last index values, compare with last values and count matched values by sum:
f = lambda x: (df1['tid'].iloc[:x.index[-1]+1] == x.iat[-1]).sum()
df1['groupentries'] = df1.groupby(["rid", "tid"])['rid'].transform('size')
df1['cumulativeentries'] = df1.groupby(["rid", "tid"])['tid'].transform(f)
print (df1)
rid tid groupentries cumulativeentries
0 1 1 1 1
1 1 2 2 2
2 1 2 2 2
3 2 1 2 3
4 2 1 2 3
5 2 3 1 1
6 3 1 1 4
7 3 4 1 1
8 4 5 1 1
9 5 1 3 7
10 5 1 3 7
11 5 1 3 7
12 5 3 1 2

Conditional multi column matching (reviewed with new example)

I tried to rework my question in order to match the quality criteria and spent more time in trying to achieve the result on my own.
Given are two DataFrames
a = DataFrame({"id" : ["id1"] * 3 + ["id2"] * 3 + ["id3"] * 3,
"left" : [6, 2, 5, 2, 1, 4, 5, 2, 4],
"right" : [1, 3, 4, 6, 5, 3, 6, 3, 2]
})
b = DataFrame({"id" : ["id1"] * 6 + ["id2"] * 6 + ["id3"] * 6,
"left_and_right" : range(1,7) * 3,
"boolen" : [0, 0, 1, 0, 1, 0, 1, 0, 0 , 1, 1, 0, 0, 0, 1, 0, 0, 1]
})
The expected result is
result = DataFrame({"id" : ["id1"] * 3 + ["id2"] * 3 + ["id3"] * 3,
"left" : [6, 2, 5, 2, 1, 4, 5, 2, 4],
"right" : [1, 3, 4, 6, 5, 3, 6, 3, 2],
"NEW": [0, 1, 1, 0, 1, 1, 1, 1, 0]
})
So I want to check in each row of DataFrame b if there is a row in DataFrame a where a.id == b.id AND then look up if b.left_and_right is in (==) a.left OR a.rigtht. If such a row is found and b.boolen is True/1 for either the value of a.left or a.right, the value of a.NEW in this row should be also True/1.
I hope the example illustrates it better than my words.
To sum it up: I want to look up if in each row where id is the same for both DataFrames whether b.boolen is True/1 for a value in b.left_and_right and if this value is in a.left or in a.right, the new value in a.NEW should also be TRUE/1.
I have tried using the pd.match() and pd.merge() function in combination with & and | operators but could not achieve the wanted result.
Some time ago I had asked a very simillar question dealing with a simillar problem in R (data was organized in a slightly other way, so it was a bit different) but now I fail using the same approach in python.
Related question: Conditional matching of two lists with multi-column data.frames
Thanks

Just use a boolean masks with & (and) and | (or):
In [11]: (a.A == b.A) & ((a.B == b.E) | (a.C == b.E)) # they all satisfy this requirement!
Out[11]:
0 True
1 True
2 True
3 True
dtype: bool
In [12]: b.D[(a.A == b.A) & ((a.B == b.E) | (a.C == b.E))]
Out[12]:
0 0
1 1
2 0
3 0
Name: D, dtype: int64
In [13]: a['NEW'] = b.D[(a.A == b.A) & ((a.B == b.E) | (a.C == b.E))]
In [14]: a
Out[14]:
A B C NEW
0 foo 1 4 0
1 goo 2 3 1
2 doo 3 1 0
3 boo 4 2 0
Update with the slightly different question:
In [21]: merged = pd.merge(a, b, on='id')
In [22]: matching = merged[(merged.left == merged.left_and_right) | (merged.right == merged.left_and_right)]
In [23]: (matching.groupby(['id', 'left', 'right'])['boolen'].sum()).reset_index()
Out[23]:
id left right boolen
0 id1 2 3 1
1 id1 5 4 1
2 id1 6 1 0
3 id2 1 5 2
4 id2 2 6 0
5 id2 4 3 1
6 id3 2 3 1
7 id3 4 2 0
8 id3 5 6 1
Note there is a 2 here, so perhaps you want to care about those > 0.
In [24]: (matching.groupby(['id', 'left', 'right'])['boolen'].sum() > 0).reset_index()
Out[24]:
id left right boolen
0 id1 2 3 True
1 id1 5 4 True
2 id1 6 1 False
3 id2 1 5 True
4 id2 2 6 False
5 id2 4 3 True
6 id3 2 3 True
7 id3 4 2 False
8 id3 5 6 True
You may want to rename the boolen column to NEW.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: group columns of duplicate rows into column of lists - python

In [320]: r = df.groupby('m')['event'].apply(np.array).reset_index(name='event') In [321]: r Out[321]: m event 0 1 [1, 2, 2] 1 2 [1, 0, 2] 2 3 [1, 1, 2, 1] Bonus: In [322]: r.loc[r.m==1, 'event'] + 1 Out[322]: 0 [2, 3, 3] Name: event, dtype: object

Related

List in dataframe is different to the order it appears in the original list?

Efficient way in Pandas to count occurrences of Series of values by row

How to take one part of a Series in Pandas?

Creating a derived column using pandas operations

Conditional multi column matching (reviewed with new example)

Categories

Resources