I have sample snippet that works as expected:
import pandas as pd
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)
The result is:
label wave y new
0 a 1 0 (1,)
1 b 2 0 (2, 3)
2 b 3 0 (2, 3)
3 c 4 0 (4,)
It works analagously, if instead of tuple in transform I give set, frozenset, dict, but if I give list I got completly unexpected result:
df['new'] = df.groupby(['label'])[['wave']].transform(list)
label wave y new
0 a 1 0 1
1 b 2 0 2
2 b 3 0 3
3 c 4 0 4
There is a workaround to get expected result:
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)['wave'].apply(list)
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
I thought about mutability/immutability (list/tuple) but for set/frozenset it is consistent.
The question is why it works in this way?
I've come across a similar issue before. The underlying issue I think is when the number of elements in the list matches the number of records in the group, it tries to unpack the list so each element of the list maps to a record in the group.
For example, this will cause the list to unpack, as the len of the list matches the length of each group:
df.groupby(['label'])[['wave']].transform(lambda x: list(x))
wave
0 1
1 2
2 3
3 4
However, if the length of the list is not the same as each group, you will get desired behaviour:
df.groupby(['label'])[['wave']].transform(lambda x: list(x)+[0])
wave
0 [1, 0]
1 [2, 3, 0]
2 [2, 3, 0]
3 [4, 0]
I think this is a side effect of the list unpacking functionality.
I think that is a bug in pandas. Can you open a ticket on their github page please?
At first I thought, it might be, because list is just not handeled correctly as argument to .transform, but if I do:
def create_list(obj):
print(type(obj))
return obj.to_list()
df.groupby(['label'])[['wave']].transform(create_list)
I get the same unexpected result. If however the agg method is used, it works directly:
df.groupby(['label'])['wave'].agg(list)
Out[179]:
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
I can't imagine that this is intended behavior.
Btw. I also find the different behavior suspicious, that shows up if you apply tuple to a grouped series and a grouped dataframe. E.g. if transform is applied to a series instead of a DataFrame, the result also is not a series containing lists, but a series containing ints (remember for [['wave']] which creates a one-columed dataframe transform(tuple) indeed returned tuples):
df.groupby(['label'])['wave'].transform(tuple)
Out[177]:
0 1
1 2
2 3
3 4
Name: wave, dtype: int64
If I do that again with agg instead of transform it works for both ['wave'] and [['wave']]
I was using version 0.25.0 on an ubuntu X86_64 system for my tests.
Since DataFrames are mainly designed to handle 2D data, including arrays instead of scalar values might stumble upon a caveat such as this one.
pd.DataFrame.trasnform is originally implemented on top of .agg:
# pandas/core/generic.py
#Appender(_shared_docs["transform"] % dict(axis="", **_shared_doc_kwargs))
def transform(self, func, *args, **kwargs):
result = self.agg(func, *args, **kwargs)
if is_scalar(result) or len(result) != len(self):
raise ValueError("transforms cannot produce " "aggregated results")
return result
However, transform always return a DataFrame that must have the same length as self, which is essentially the input.
When you do an .agg function on the DataFrame, it works fine:
df.groupby('label')['wave'].agg(list)
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
The problem gets introduced when transform tries to return a Series with the same length.
In the process to transforming a groupby element which is a slice from self and then concatenating this again, lists gets unpacked to the same length of index as #Allen mentioned.
However, when they don't align, then don't get unpacked:
df.groupby(['label'])[['wave']].transform(lambda x: list(x) + [1])
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]
A workaround this problem might be avoiding transform:
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df = df.merge(df.groupby('label')['wave'].agg(list).rename('new'), on='label')
df
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
Another interesting work around, that works for strings, is:
df = df.applymap(str) # Make them all strings... would be best to use on non-numeric data.
df.groupby(['label'])['wave'].transform(' '.join).str.split()
Output:
0 [1]
1 [2, 3]
2 [2, 3]
3 [4]
Name: wave, dtype: object
The suggested answers does not work on Pandas 1.2.4 anymore. Here is a workaround for it:
df.groupby(['label'])[['wave']].transform(lambda x: [list(x) + [1]]*len(x))
The idea behind it is the same as explained in other answers (e.g. #Allen's answer). Therefore, the solution here is wrap the function into another list and repeat it same number as the group length, so that when pandas transform unwraps it, each row gets the inside list.
output:
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]
Related
My requirement is I have a large dataframe with millions of rows. I encoded all strings to numeric values in order to use numpys vectorization to increase processing speed.
So I was looking at a way to quickly check if a number exists in another list column. Previously, I was using list comprehension with string values, but with after converting to np.arrays was looking at similar function.
I stumbled across this link: check if values of a column are in values of another numpy array column in pandas
In order to the numpy.isin, I tried running below code:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,5,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 5 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
When I enter:
np.isin(dt['col_a'], dt['col_b'])
The output is:
array([False, True, False, False, True])
Which is incorrect as the 3rd row has 5 in both columns col_a and col_b.
Where as if I change the value to 4 as below:
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [1,2,4,1,2],
'col_b': [2,2,[2,5,4],4,[1,5,6,3,2]]})
dt
id col_a col_b
0 a 1 2
1 a 2 2
2 a 4 [2, 5, 4]
3 b 1 4
4 b 2 [1, 5, 6, 3, 2]
and execute same code:
np.isin(dt['col_a'], dt['col_b'])
I get correct result:
array([False, True, True, False, True])
Can someone please let me know why it's giving different results.
Since col_b not only has lists but also integers, you may need to use apply and treat them differently:
( dt.apply(lambda x: x['col_a'] in x['col_b'] if type(x['col_b']) is list
else x['col_a'] == x['col_b'], axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
np.isin for each element from dt['col_a'] checks whether it is present in the whole dt['col_b'] column, i.e.:
[
1 in dt['col_b'],
2 in dt['col_b'],
5 in dt['col_b'],
...
]
There's no 5 in dt['col_b'] but there's 4
From the docs
isin is an element-wise function version of the python keyword in. isin(a, b) is roughly equivalent to np.array([item in b for item in a]) if a and b are 1-D sequences.
Also, your issue is that you have an inconsistent dt['col_b'] column (some values are numbers some are lists). I think the easiest approach is to use apply:
def isin(row):
if isinstance(row['col_b'], int):
return row['col_a'] == row['col_b']
else:
return row['col_a'] in row['col_b']
dt.apply(isin, axis=1)
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
I have a pandas DataFrame where each cell is a set of numbers. I would like to go through the DataFrame and run each number along with the row index in a function. What's the most pandas-esque and efficient way to do this? Here's an example of one way to do it with for-loops, but I'm hopeful that there's a better approach.
def my_func(a, b):
pass
d = {"a": [{1}, {4}], "b": [{1, 2, 3}, {2}]}
df = pd.DataFrame(d)
for index, item in df.iterrows():
for j in item:
for a in list(j):
my_func(index, a)
Instead of iterating we can reshape the values into 1 column using stack then explode into separate rows:
s:
df.stack().explode()
0 a 1
b 1
b 2
b 3
1 a 4
b 2
dtype: object
We can further droplevel if we don't want the old column labels:
s = df.stack().explode().droplevel(1)
s:
0 1
0 1
0 2
0 3
1 4
1 2
dtype: object
reset_index can be used to create a DataFrame instead of a Series:
new_df = df.stack().explode().droplevel(1).reset_index()
new_df.columns = ['a', 'b'] # Rename columns to whatever
new_df:
a b
0 0 1
1 0 1
2 0 2
3 0 3
4 1 4
5 1 2
If i fully understood your problem. This might be one way of doing it:
[list(item) for sublist in df.values.tolist() for item in sublist]
The output will look like this:
[[1], [1, 2, 3], [4], [2]]
Since this is a nested list, you can flatten it if your requirement is a single list.
I want to filter my df down to only those rows who have a value in column A which appears less frequently than some threshold. I currently am using a trick with two value_counts(). To explain what I mean:
df = pd.DataFrame([[1, 2, 3], [1, 4, 5], [6, 7, 8]], columns=['A', 'B', 'C'])
'''
A B C
0 1 2 3
1 1 4 5
2 6 7 8
'''
I want to remove any row whose value in the A column appears < 2 times in the column A. I currently do this:
df = df[df['A'].isin(df.A.value_counts()[df.A.value_counts() >= 2].index)]
Does Pandas have a method to do this which is cleaner than having to call value_counts() twice?
It's probably easiest to filter by group size, where the groups are done on column A.
df.groupby('A').filter(lambda x: len(x) >=2)
I'm trying to union several pd.DataFrames along the column axis, using the index to remove duplicates (A and B are from the same source "table" filterd by different predicates and I'm tring to recombine).
A = pd.DataFrame({"values": [1, 2]}, pd.MultiIndex.from_tuples([(1,1),(1,2)], names=('l1', 'l2')))
B = pd.DataFrame({"values": [2, 3, 2]}, pd.MultiIndex.from_tuples([(1,2),(2,1),(2,2)], names=('l1', 'l2')))
pd.concat([A,B]).drop_duplicates() fails since it ignores the index and de-dups on the values so it removed index item (2,2)
pd.concat([A.reset_index(),B.reset_index()]).drop_duplicates(subset=('l1', 'l2')).set_index(['l1', 'l2']) does what I want, but I feel like there should be a better way.
you may do a simple concat and filter out dups by using index.duplicated
df1 = pd.concat([A,B])
df1[~df1.index.duplicated()]
Out[123]:
values
l1 l2
1 1 1
2 2
2 1 3
2 2
I try to loop trough rows of a DataFrame with a function calculation most frequent element in a series. The function works perfectly when i manually supply a series into it:
# Create DataFrame
df = pd.DataFrame({'a' : [1, 2, 1, 2, 1, 2, 1, 1],
'b' : [1, 1, 2, 1, 1, 1, 2, 2],
'c' : [1, 2, 2, 1, 2, 2, 2, 1]})
# Create function calculating most frequent element
from collections import Counter
def freq_value(series):
return Counter(series).most_common()[0][0]
# Test function on one row
freq_value(df.iloc[1])
# Another test
freq_value((df.iloc[1, 0], df.iloc[1, 1], df.iloc[1, 2]))
With both tests I get the desired result. However, when i try to apply this function in a loop through DataFrame rows and save the result into new column, i get an error "'Series' object is not callable", 'occurred at index 0'. The line producing the error is as follows:
# Loop trough rows of a dataframe and write the result into new column
df['result'] = df.apply(lambda row: freq_value((row('a'), row('b'), row('c'))), axis = 1)
How exactly row() in apply() function works? Shouldn't it supply to my freq_value() function values from columns 'a', 'b', 'c'?
#jpp's answer addresses how to apply your custom function, but you can also get the desired result using df.mode, with axis=1. This will avoid the use of apply, and will still give you a column of the most common value for each row.
df['result'] = df.mode(1)
>>> df
a b c result
0 1 1 1 1
1 2 1 2 2
2 1 2 2 2
3 2 1 1 1
4 1 1 2 1
5 2 1 2 2
6 1 2 2 2
7 1 2 1 1
row is not a function within your lambda, so parentheses are not appropriate, Instead, you should use the __getitem__ method or loc accessor to access values. The syntactic sugar for the former is []:
df['result'] = df.apply(lambda row: freq_value((row['a'], row['b'], row['c'])), axis=1)
Using the loc alternative:
def freq_value_calc(row):
return freq_value((row.loc['a'], row.loc['b'], row.loc['c']))
To understand exactly why this is the case, it helps to rewrite your lambda as a named function:
def freq_value_calc(row):
print(type(row)) # useful for debugging
return freq_value((row['a'], row['b'], row['c']))
df['result'] = df.apply(freq_value_calc, axis=1)
Running this, you'll find that row is of type <class 'pandas.core.series.Series'>, i.e. a series indexed by column labels if you use axis=1. To access the value in a series for a given label, you can either use __getitem__ / [] syntax or loc.
df['CommonValue'] = df.apply(lambda x: x.mode()[0], axis = 1)