Idiomatic way to create pandas dataframe as concatenation of function of another's rows - python

Say I have one dataframe
import pandas as pd
input_df = pd.DataFrame(dict(a=[1, 2], b=[2, 3]))
Also I have a function f that maps each row to another dataframe. Here's an example of such a function. Note that in general the function could take any form so I'm not looking for answers that use agg to reimplement the f below.
def f(row):
return pd.DataFrame(dict(x=[row['a'] * row['b'], row['a'] + row['b']],
y=[row['a']**2, row['b']**2]))
I want to create one dataframe that is the concatenation of the function applied to each of the first dataframe's rows. What is the idiomatic way to do this?
output_df = pd.concat([f(row) for _, row in input_df.iterrows()])
I thought I should be able to use apply or similar for this purpose but nothing seemed to work.
x y
0 2 1
1 3 4
0 6 4
1 5 9

You can use DataFrame.agg to calucalate sum and prod and numpy.ndarray.reshape, df.pow(2)/np.sqaure for calculating sqaure.
out = pd.DataFrame({'x': df.agg(['prod', 'sum'],axis=1).to_numpy().reshape(-1),
'y': np.square(df).to_numpy().reshape(-1)})
out
x y
0 2 1
1 3 4
2 6 4
3 5 9

Yoy should avoid iterating rows (How to iterate over rows in a DataFrame in Pandas).
Instead try:
df = df.assign(product=df.a*df.b, sum=df.sum(axis=1),
asq=df.a**2, bsq=df.b**2)
Then:
df = [[[p, s], [asq, bsq]] for p, s, asq, bsq in df.to_numpy()]

Related

Applying function to multiple columns in pandas dataframe to make 1 new column

fruits.xlsx
I'm trying to apply a function that:
takes the value of each cell in a column divided by the mean of its respective column.
Then create a column called ['Score'] that has the sum of each cell value in a row computed from step 1.
My code so far:
import pandas as pd
df = pd.DataFrame(pd.read_excel('fruits.xlsx'))
def func(column):
out = df[column].values / df[column].mean()
return out
Im really unsure of how to execute this with pandas properly.
Try this one will calculate exactly what you need in one single line:
df['Score'] = df.apply(lambda x: sum([x[i]/df[i].mean() for i in df.columns]),axis=1)
You can do it like this
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['a', 'b', 'c'])
df['score'] = df.div(df.mean()).sum(axis=1)
Output
a b c score
0 1 2 3 1.15
1 4 5 6 3.00
2 7 8 9 4.85
you can make the output as a column in the dataframe
df["Score"] = df[<col_name>] / df[<col_name>].mean()
and you can use
df["Score"] = df[<col_name>].values / df[<col_name>].mean()
I tested both and both gave me the same output in the dataframe

How to iterate over every cell in pandas Dataframe?

How to iterate over every cell of a pandas Dataframe?
Like
for every_cell in df.iterrows:
print(cell_value)
printing of course is not the goal.
Cell values of the df should be updated in a MongoDB.
if it has to be a for loop, you can do it like this:
def up_da_ter3(df):
columns = df.columns.tolist()
for _, i in df.iterrows():
for c in columns:
print(i[c])
print("############")
You can use applymap. It will iterate down each column, starting with the left most. But in general you almost never need to iterate over every value of a DataFrame, pandas has much more performant ways to accomplish calculations.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(6).reshape(-1, 2), columns=['A', 'B'])
# A B
#0 0 1
#1 2 3
#2 4 5
df.applymap(lambda x: print(x))
0
2
4
1
3
5
If you need it to raster through the DataFrame across rows you can transpose first:
df.T.applymap(lambda x: print(x))
0
1
2
3
4
5
One technique as an option would be using double for loop within square brackets such as
for i in [df[j][k] for k in range(0,len(df)) for j in df.columns]:
print(i)
in order to iterate starting from the first column to the last one of the first row, and then repeat the same process for each row by the order of members of lists.

Apply function to two columns and map the output to a new column [duplicate]

This question already has answers here:
How to apply a function to two columns of Pandas dataframe
(15 answers)
Closed 3 years ago.
I am new to Pandas. Would like to know how to apply a function to two columns in a dataframe and map the output from the function to a new column in the dataframe. Is this at all possible with pandas syntax or should I resort to native Python to iterate over the rows in the dataframe columns to generate the new column?
a b
1 2
3 1
2 9
Question is how to get, for example, the multiplication of the two numbers in a new column c
a b c
1 2 2
3 1 3
2 9 18
You can do with pandas.
For example:
def funcMul(row):
return row['a']*row['b']
Then,
df['c'] = df.apply(funcMul,1)
Output:
a b c
0 1 2 2
1 3 1 3
2 2 9 18
You can do the following with pandas
import pandas as pd
def func(r):
return r[0]*r[1]
df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df['c'] = df.apply(func, axis = 1)
Also, here is the official documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
The comment by harvpan shows the simplest way to achieve your specific example, but here is a generic way to do what you asked:
def functionUsedInApply(row):
""" The function logic for the apply function comes here.
row: A Pandas Series containing the a row in df.
"""
return row['a'] * row['b']
def functionUsedInMap(value):
""" This function is used in the map after the apply.
For this example, if the value is larger than 5,
return the cube, otherwise, return the square.
value: a value of whatever type is returned by functionUsedInApply.
"""
if value > 5:
return value**3
else:
return value**2
df['new_column_name'] = df.apply(functionUsedInApply,axis=1).map(functionUsedInMap)
The function above first adds columns a and b together and then returns the square of that value for a+b <=5 and the cube of that value for a+b > 5.

How to transform the result of a Pandas `GROUPBY` function to the original dataframe

Suppose I have a Pandas DataFrame with 6 columns and a custom function that takes counts of the elements in 2 or 3 columns and produces a boolean output. When a groupby object is created from the original dataframe and the custom function is applied df.groupby('col1').apply(myfunc), the result is a series whose length is equal to the number of categories of col1. How do I expand this output to match the length of the original dataframe? I tried transform, but was not able to use the custom function myfunc with it.
EDIT:
Here is an example code:
A = pd.DataFrame({'X':['a','b','c','a','c'], 'Y':['at','bt','ct','at','ct'], 'Z':['q','q','r','r','s']})
print (A)
def myfunc(df):
return ((df['Z'].nunique()>=2) and (df['Y'].nunique()<2))
A.groupby('X').apply(myfunc)
I would like to expand this output as a new column Result such that where there is a in column X, the Result will be True.
You can map the groupby back to the original dataframe
A['Result'] = A['X'].map(A.groupby('X').apply(myfunc))
Result would look like:
X Y Z Result
0 a at q True
1 b bt q False
2 c ct r True
3 a at r True
4 c ct s True
My solution may not be the best one, which uses a loop, but it's pretty good I think.
The core idea is you can traverse all the sub-dataframe (gdf) by for i, gdf in gp. Then add the column result (in my example it is c) for each sub-dataframe. Finally concat all the sub-dataframe into one.
Here is an example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,1,2],'b':['a','b','c','d']})
gp = df.groupby('a') # group
s = gp.apply(sum)['a'] # apply a func
adf = []
# then create a new dataframe
for i, gdf in gp:
tdf = gdf.copy()
tdf.loc[:,'c'] = s.loc[i]
adf.append(tdf)
pd.concat(adf)
from:
a b
0 1 a
1 2 b
2 1 c
3 2 d
to:
a b c
0 1 a 2
2 1 c 2
1 2 b 4
3 2 d 4

Apply function to pandas DataFrame that can return multiple rows

I am trying to transform DataFrame, such that some of the rows will be replicated a given number of times. For example:
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count':[1,0,2]})
class count
0 A 1
1 B 0
2 C 2
should be transformed to:
class
0 A
1 C
2 C
This is the reverse of aggregation with count function. Is there an easy way to achieve it in pandas (without using for loops or list comprehensions)?
One possibility might be to allow DataFrame.applymap function return multiple rows (akin apply method of GroupBy). However, I do not think it is possible in pandas now.
You could use groupby:
def f(group):
row = group.irow(0)
return DataFrame({'class': [row['class']] * row['count']})
df.groupby('class', group_keys=False).apply(f)
so you get
In [25]: df.groupby('class', group_keys=False).apply(f)
Out[25]:
class
0 A
0 C
1 C
You can fix the index of the result however you like
I know this is an old question, but I was having trouble getting Wes' answer to work for multiple columns in the dataframe so I made his code a bit more generic. Thought I'd share in case anyone else stumbles on this question with the same problem.
You just basically specify what column has the counts in it in and you get an expanded dataframe in return.
import pandas as pd
df = pd.DataFrame({'class 1': ['A','B','C','A'],
'class 2': [ 1, 2, 3, 1],
'count': [ 3, 3, 3, 1]})
print df,"\n"
def f(group, *args):
row = group.irow(0)
Dict = {}
row_dict = row.to_dict()
for item in row_dict: Dict[item] = [row[item]] * row[args[0]]
return pd.DataFrame(Dict)
def ExpandRows(df,WeightsColumnName):
df_expand = df.groupby(df.columns.tolist(), group_keys=False).apply(f,WeightsColumnName).reset_index(drop=True)
return df_expand
df_expanded = ExpandRows(df,'count')
print df_expanded
Returns:
class 1 class 2 count
0 A 1 3
1 B 2 3
2 C 3 3
3 A 1 1
class 1 class 2 count
0 A 1 1
1 A 1 3
2 A 1 3
3 A 1 3
4 B 2 3
5 B 2 3
6 B 2 3
7 C 3 3
8 C 3 3
9 C 3 3
With regards to speed, my base df is 10 columns by ~6k rows and when expanded is ~100,000 rows takes ~7 seconds. I'm not sure in this case if grouping is necessary or wise since it's taking all the columns to group form, but hey whatever only 7 seconds.
There is even a simpler and significantly more efficient solution.
I had to make similar modification for a table of about 3.5M rows, and the previous suggested solutions were extremely slow.
A better way is to use numpy's repeat procedure for generating a new index in which each row index is repeated multiple times according to its given count, and use iloc to select rows of the original table according to this index:
import pandas as pd
import numpy as np
df = pd.DataFrame({'class': ['A', 'B', 'C'], 'count': [1, 0, 2]})
spread_ixs = np.repeat(range(len(df)), df['count'])
spread_ixs
array([0, 2, 2])
df.iloc[spread_ixs, :].drop(columns='count').reset_index(drop=True)
class
0 A
1 C
2 C
This question is very old and the answers do not reflect pandas modern capabilities. You can use iterrows to loop over every row and then use the DataFrame constructor to create new DataFrames with the correct number of rows. Finally, use pd.concat to concatenate all the rows together.
pd.concat([pd.DataFrame(data=[row], index=range(row['count']))
for _, row in df.iterrows()], ignore_index=True)
class count
0 A 1
1 C 2
2 C 2
This has the benefit of working with any size DataFrame.

Categories