Loop through subsets of rows in a DataFrame - python

I try to loop trough rows of a DataFrame with a function calculation most frequent element in a series. The function works perfectly when i manually supply a series into it:
# Create DataFrame
df = pd.DataFrame({'a' : [1, 2, 1, 2, 1, 2, 1, 1],
'b' : [1, 1, 2, 1, 1, 1, 2, 2],
'c' : [1, 2, 2, 1, 2, 2, 2, 1]})
# Create function calculating most frequent element
from collections import Counter
def freq_value(series):
return Counter(series).most_common()[0][0]
# Test function on one row
freq_value(df.iloc[1])
# Another test
freq_value((df.iloc[1, 0], df.iloc[1, 1], df.iloc[1, 2]))
With both tests I get the desired result. However, when i try to apply this function in a loop through DataFrame rows and save the result into new column, i get an error "'Series' object is not callable", 'occurred at index 0'. The line producing the error is as follows:
# Loop trough rows of a dataframe and write the result into new column
df['result'] = df.apply(lambda row: freq_value((row('a'), row('b'), row('c'))), axis = 1)
How exactly row() in apply() function works? Shouldn't it supply to my freq_value() function values from columns 'a', 'b', 'c'?

#jpp's answer addresses how to apply your custom function, but you can also get the desired result using df.mode, with axis=1. This will avoid the use of apply, and will still give you a column of the most common value for each row.
df['result'] = df.mode(1)
>>> df
a b c result
0 1 1 1 1
1 2 1 2 2
2 1 2 2 2
3 2 1 1 1
4 1 1 2 1
5 2 1 2 2
6 1 2 2 2
7 1 2 1 1

row is not a function within your lambda, so parentheses are not appropriate, Instead, you should use the __getitem__ method or loc accessor to access values. The syntactic sugar for the former is []:
df['result'] = df.apply(lambda row: freq_value((row['a'], row['b'], row['c'])), axis=1)
Using the loc alternative:
def freq_value_calc(row):
return freq_value((row.loc['a'], row.loc['b'], row.loc['c']))
To understand exactly why this is the case, it helps to rewrite your lambda as a named function:
def freq_value_calc(row):
print(type(row)) # useful for debugging
return freq_value((row['a'], row['b'], row['c']))
df['result'] = df.apply(freq_value_calc, axis=1)
Running this, you'll find that row is of type <class 'pandas.core.series.Series'>, i.e. a series indexed by column labels if you use axis=1. To access the value in a series for a given label, you can either use __getitem__ / [] syntax or loc.

df['CommonValue'] = df.apply(lambda x: x.mode()[0], axis = 1)

Related

Can a pd.Series be assigned to a column in an out-of-order pd.DataFrame without mapping to index (i.e. without reordering the values)?

I discovered some unexpected behavior when creating or assigning a new column in Pandas. When I filter or sort the pd.DataFrame (thus mixing up the indexes) and then create a new column from a pd.Series, Pandas reorders the series to map to the DataFrame index. For example:
df = pd.DataFrame({'a': ['alpha', 'beta', 'gamma']},
index=[2, 0, 1])
df['b'] = pd.Series(['alpha', 'beta', 'gamma'])
index
a
b
2
alpha
gamma
0
beta
alpha
1
gamma
beta
I think this is happening because the pd.Series has an index [0, 1, 2] which is getting mapped to the pd.DataFrame index. But I wanted to create the new column with values in the correct "order" ignoring index:
index
a
b
2
alpha
alpha
0
beta
beta
1
gamma
gamma
Here's a convoluted example showing how unexpected this behavior is:
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: pd.Series(list(x['num']*2)))
index
num
num_times_two
2
1
6
0
2
2
1
3
4
If I use any function that strips the index off the original pd.Series and then returns a new pd.Series, the values get out of order.
Is this a bug in Pandas or intentional behavior? Is there any way to force Pandas to ignore the index when I create a new column from a pd.Series?
If you don't want the conversions of dtypes between pandas and numpy (for example, with datetimes), you can set the index of the Series same as the index of the DataFrame before assigning to a column:
either with .set_axis()
The original Series will have its index preserved - by default this operation is not in place:
ser = pd.Series(['alpha', 'beta', 'gamma'])
df['b'] = ser.set_axis(df.index)
or you can change the index of the original Series:
ser.index = df.index # ser.set_axis(df.index, inplace=True) # alternative
df['b'] = ser
OR:
Use a numpy array instead of a Series. It doesn't have indices, so there is nothing to be aligned by.
Any Series can be converted to a numpy array with .to_numpy():
df['b'] = ser.to_numpy()
Any other array-like also can be used, for example, a list.
I don't know if it is on purpose, but the new column assignment is based on index, do you need to maintain the old indexes?
If the answer is no you can simply reset the index before adding a new column
df.reset_index(drop=True)
In your example, I don't see any reason to make it a new Series? (Even if something strips the index, like converting to a list)
df = pd.DataFrame({'num': [1, 2, 3]}, index=[2, 0, 1]) \
.assign(num_times_two=lambda x: list(x['num']*2))
print(df)
Output:
num num_times_two
2 1 2
0 2 4
1 3 6

Pandas transform inconsistent behavior for list

I have sample snippet that works as expected:
import pandas as pd
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)
The result is:
label wave y new
0 a 1 0 (1,)
1 b 2 0 (2, 3)
2 b 3 0 (2, 3)
3 c 4 0 (4,)
It works analagously, if instead of tuple in transform I give set, frozenset, dict, but if I give list I got completly unexpected result:
df['new'] = df.groupby(['label'])[['wave']].transform(list)
label wave y new
0 a 1 0 1
1 b 2 0 2
2 b 3 0 3
3 c 4 0 4
There is a workaround to get expected result:
df['new'] = df.groupby(['label'])[['wave']].transform(tuple)['wave'].apply(list)
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
I thought about mutability/immutability (list/tuple) but for set/frozenset it is consistent.
The question is why it works in this way?
I've come across a similar issue before. The underlying issue I think is when the number of elements in the list matches the number of records in the group, it tries to unpack the list so each element of the list maps to a record in the group.
For example, this will cause the list to unpack, as the len of the list matches the length of each group:
df.groupby(['label'])[['wave']].transform(lambda x: list(x))
wave
0 1
1 2
2 3
3 4
However, if the length of the list is not the same as each group, you will get desired behaviour:
df.groupby(['label'])[['wave']].transform(lambda x: list(x)+[0])
wave
0 [1, 0]
1 [2, 3, 0]
2 [2, 3, 0]
3 [4, 0]
I think this is a side effect of the list unpacking functionality.
I think that is a bug in pandas. Can you open a ticket on their github page please?
At first I thought, it might be, because list is just not handeled correctly as argument to .transform, but if I do:
def create_list(obj):
print(type(obj))
return obj.to_list()
df.groupby(['label'])[['wave']].transform(create_list)
I get the same unexpected result. If however the agg method is used, it works directly:
df.groupby(['label'])['wave'].agg(list)
Out[179]:
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
I can't imagine that this is intended behavior.
Btw. I also find the different behavior suspicious, that shows up if you apply tuple to a grouped series and a grouped dataframe. E.g. if transform is applied to a series instead of a DataFrame, the result also is not a series containing lists, but a series containing ints (remember for [['wave']] which creates a one-columed dataframe transform(tuple) indeed returned tuples):
df.groupby(['label'])['wave'].transform(tuple)
Out[177]:
0 1
1 2
2 3
3 4
Name: wave, dtype: int64
If I do that again with agg instead of transform it works for both ['wave'] and [['wave']]
I was using version 0.25.0 on an ubuntu X86_64 system for my tests.
Since DataFrames are mainly designed to handle 2D data, including arrays instead of scalar values might stumble upon a caveat such as this one.
pd.DataFrame.trasnform is originally implemented on top of .agg:
# pandas/core/generic.py
#Appender(_shared_docs["transform"] % dict(axis="", **_shared_doc_kwargs))
def transform(self, func, *args, **kwargs):
result = self.agg(func, *args, **kwargs)
if is_scalar(result) or len(result) != len(self):
raise ValueError("transforms cannot produce " "aggregated results")
return result
However, transform always return a DataFrame that must have the same length as self, which is essentially the input.
When you do an .agg function on the DataFrame, it works fine:
df.groupby('label')['wave'].agg(list)
label
a [1]
b [2, 3]
c [4]
Name: wave, dtype: object
The problem gets introduced when transform tries to return a Series with the same length.
In the process to transforming a groupby element which is a slice from self and then concatenating this again, lists gets unpacked to the same length of index as #Allen mentioned.
However, when they don't align, then don't get unpacked:
df.groupby(['label'])[['wave']].transform(lambda x: list(x) + [1])
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]
A workaround this problem might be avoiding transform:
df = pd.DataFrame(data={'label': ['a', 'b', 'b', 'c'], 'wave': [1, 2, 3, 4], 'y': [0,0,0,0]})
df = df.merge(df.groupby('label')['wave'].agg(list).rename('new'), on='label')
df
label wave y new
0 a 1 0 [1]
1 b 2 0 [2, 3]
2 b 3 0 [2, 3]
3 c 4 0 [4]
Another interesting work around, that works for strings, is:
df = df.applymap(str) # Make them all strings... would be best to use on non-numeric data.
df.groupby(['label'])['wave'].transform(' '.join).str.split()
Output:
0 [1]
1 [2, 3]
2 [2, 3]
3 [4]
Name: wave, dtype: object
The suggested answers does not work on Pandas 1.2.4 anymore. Here is a workaround for it:
df.groupby(['label'])[['wave']].transform(lambda x: [list(x) + [1]]*len(x))
The idea behind it is the same as explained in other answers (e.g. #Allen's answer). Therefore, the solution here is wrap the function into another list and repeat it same number as the group length, so that when pandas transform unwraps it, each row gets the inside list.
output:
wave
0 [1, 1]
1 [2, 3, 1]
2 [2, 3, 1]
3 [4, 1]

Pandas: using iloc to retrieve data does not match input index

I have a dataset which contains contributor's id and contributor_message. I wanted to retrieve all samples with the same message, say, contributor_message == 'I support this proposal because...'.
I use data.loc[data.contributor_message == 'I support this proposal because...'].index -> so basically you can get the index in the DataFrame with the same message, say those indices are 1, 2, 50, 9350, 30678,...
Then I tried data.iloc[[1,2,50]] and this gives me correct answer, i.e. the indices matches with the DataFrame indices.
However, when I use data.iloc[9350] or higher indices, I will NOT get the corresponding DataFrame index. Say I got 15047 in the DataFrame this time.
Can anyone advise how to fix this problem?
This occurs when your indices are not aligned with their integer location.
Note that pd.DataFrame.loc is used to slice by index and pd.DataFrame.iloc is used to slice by integer location.
Below is a minimal example.
df = pd.DataFrame({'A': [1, 2, 1, 1, 5]}, index=[0, 1, 2, 4, 5])
idx = df[df['A'] == 1].index
print(idx) # Int64Index([0, 2, 4], dtype='int64')
res1 = df.loc[idx]
res2 = df.iloc[idx]
print(res1)
# A
# 0 1
# 2 1
# 4 1
print(res2)
# A
# 0 1
# 2 1
# 5 5
You have 2 options to resolve this problem.
Option 1
Use pd.DataFrame.loc to slice by index, as above.
Option 2
Reset index and use pd.DataFrame.iloc:
df = df.reset_index(drop=True)
idx = df[df['A'] == 1].index
res2 = df.iloc[idx]
print(res2)
# A
# 0 1
# 2 1
# 3 1

Min of Str Column in Pandas

I have a dataframe where one column contains a list of values, e.g.
dict = {'a' : [0, 1, 2], 'b' : [4, 5, 6]}
df = pd.DataFrame(dict)
df.loc[:, 'c'] = -1
df['c'] = df.apply(lambda x: [x.a, x.b], axis=1)
So I get:
a b c
0 0 4 [0, 4]
1 1 5 [1, 5]
2 2 6 [2, 6]
I now would like to save the minimum value of each entry of column c in a new column d, which should give me the following data frame:
a b c d
0 0 4 [0, 4] 0
1 1 5 [1, 5] 1
2 2 6 [2, 6] 2
Somehow though I always fail to do it with min() or similar. Right now I am using df.apply(lambda x: min(x['c'], axis=1). But that is too slow in my case. Do you know of a faster way of doing it?
Thanks!
You can get help from numpy:
import numpy as np
df['d'] = np.array(df['c'].tolist()).min(axis=1)
As stated in the comments, if you don't need the column c then:
df['d'] = df[['a','b']].min(axis=1)
Remember that series (like df['c']) are iterable. You can then create a new list and set it as a key, just like you would a dictionary. The list will automatically be cast to a pd.Series object. No need to use fancy pandas functions unless you are dealing with really (really) big data.
df['d'] = [min(c) for c in df['c']]
Edit: update to comments below
df['d'] = [min(c, key=lambda v: v - df.a) for c in df['c']]
This doesn't work because v is a value (in the first iteration it is passed 0, then 4, for example). df.a is a series. v - df.a is a new series with the elements [v - df.a[0], v - df.a[1], ...]. Then min tries to compare these series keys, which doesn't make any sense, because it will be testing if True, False, ...] or something like that which pandas throws an error for because it doens't really make sense. What you need is
df['d'] = [min(c, key=lambda v: v - df['a'][i]) for i, c in enumerate(df['c'])]
# I prefer to use df['a'] rather than df.a
so you take each value of df['a'] in turn from v, not the entire series df['a'].
However, taking a constant when calculating the minimum will do absolutely nothing, but I'm guessing this is simplified from what you are actually doing. The two samples above will do exactly the same thing.
This is a functional solution.
df['d'] = list(map(min, df['c']))
It works because:
df['c'] is a pd.Series, which is an iterable object.
map is a lazy operator which applies a function to each element of an iterable.
Since map is lazy, we must apply list in order to assign to a series.

Creating Pivot DataFrame using Multiple Columns in Pandas

I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.
A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...

Categories