This question already has answers here:
Dynamically evaluate an expression from a formula in Pandas
(2 answers)
Closed 2 years ago.
I have a dataframe with 3 columns a, b, c like below:
df = pd.DataFrame({'a':[1,1,5,3], 'b':[2,0,6,1], 'c':[4,3,1,4]})
I want to add column d which is sum of some columns in df, but is not the same column for each row, for example
only row 1 and 3 is sum from the same column, row 0 and 2 is sum from others columns.
what I found on Stack over flow is always for certain column for whole dataframe, but in this case it is differnt.
How is the best way I can do it?
Because column d is randomly calculated, the only way to do it for each row, is separately.
df['d'] = 0
df['d'].iloc[0] = df['b'].iloc[0]
df['d'].iloc[1] = df['a'].iloc[1] + df['c'].iloc[1]
df['d'].iloc[2] = df['a'].iloc[2]
df['d'].iloc[3] = df['a'].iloc[3] + df['c'].iloc[3]
If rows 1 and 3, have a rule:
df['d'].loc[(df.index % 2)==1] = df['a'].iloc[df.index] + df['c'].iloc[df.index]
Also, with for-loop:
for i in range(0, 4):
if i % 2 == 1:
df['d'].iloc[i] = df['a'].iloc[i] + df['c'].iloc[i]
The dynamic way uses pd.eval(), as per [this solution][1]. This evaluates each row's formula individually, which allows df['formula'] to be different on each row, and nothing is hardcoded in your code. There's a huge amount going on in this one-liner, see the explanation in Notes below.
df.apply(lambda row: pd.eval(row['formula'], local_dict=row.to_dict()), axis=1)
0 2
1 4
2 5
3 4
# ^--- this is the result
and if you want to assign that result to a dataframe column, say df['z']:
df['z'] = df.apply(lambda row: pd.eval(row['formula'], local_dict=row.to_dict()), axis=1)
alternatively you could use pd.eval(..., inplace=True), but then the formula would need to contain an actual assignment, e.g. 'z=a+b', and also the 'z' column would need to have been declared already: df['z'] = np.NaN. That part is slightly annoying to implement, so I didn't.
NOTES:
we use pd.eval(...) to dynamically evaluate the ['formula'] column
...using the pd.eval(.., local_dict=...) argument to pass in the variables for that row
to evaluate an expression on each dataframe row, we use df.apply(..., axis=1). We have to provide some lambda function to tell it what to evaluate.
So how does pd.eval() know how to map the strings a,b,c to their values on that individual row?
When we call df.apply(..., axis=1) row-wise like that, each row gets passed in as an individual Series, so within our apply(... axis=1), we can no longer reference the dataframe as df or its columns as df['a'], df['b'], ...
So instead we need to pass in that row as a Python dict, hence the local_dict=row.to_dict() argument to pd.eval, inside the lambda function.
The pd.eval() approach can handle arbitrarily complicated formulas in the variables, not just simple sums; it can handle e.g. (a + c**2)/(b+c). You could reference external constants, or external functions e.g. log10.
References:
[1]: Compute dataframe columns from a string formula in variables?
Related
I have two questions, but first I will give the context. I am trying to use a pandas DataFrame with some existing code using a functional programming approach. I basically want to map a function to every row of a DataFrame, expanding the row using the double-asterisk keyword argument notation, where each column name of the DataFrame corresponds to one of the arguments of the existing function.
For example, say I have the following function.
def line(m, x, b):
y = (m * x) + b
return y
And I have a pandas DataFrame
data = [{"b": 1, "m": 1, "x": 2}, {"b": 2, "m": 2, "x": 3}]
df = pd.DataFrame(data)
# Returns
# b m x
# 0 1 1 2
# 1 2 2 3
Ultimately, I want to construct a column in the DataFrame from the results of line applied to each row; something like the following.
# Note that I'm using the list of dicts defined above, not the DataFrame.
results = [line(**datum) for datum in data]
I feel like I should be able to use some combination of DataFrame.apply, a lambda, probably Series.to_dict, and the double-asterisk keyword argument expansion but I can't figure out what is passed to the lambda in the following expression.
df.apply(lambda x: x, axis=1)
# ^
# What is pandas passing to my identity lambda?
I've tried to inspect with type and x.__class__, but both of the following lines throw TypeErrors.
df.apply(lambda x: type(x), axis=1)
df.apply(lambda x: x.__class__, axis=1)
I don't want to write/refactor a new line function that can wrangle some pandas object because I shouldn't have to. Ultimately, I want to end up with a DataFrame with columns for the input data and a column with the corresponding output of the line function.
My two questions are:
How can I pass a row of a pandas DataFrame to a function using keyword-argument expansion, either using the DataFrame.apply method or some other (functional) approach?
What exactly is DataFrame.apply passing to the function that I specify?
Maybe there is some other functional approach I could take that I'm just not aware of, but I figure pandas is a pretty popular library for this kind of thing and that's why I'm trying to use it. Also there are some data (de)serialization issues I'm facing that pandas should make pretty easy vs. writing a more bespoke solution.
Thanks.
Maybe this is what you are looking for.
1)
df.apply(lambda x: line(**x.to_dict()), axis=1)
Result
0 3
1 8
2)
The function for df.apply(..., axis=1) receives a Series representing a row with the column names as index entries.
I'm having an issue where I need a function to find a corresponding value from within a dataframe with multiple rows within them, looking similar to:
Value
0 1.2332165631653
1 6.5651324661235
2 2.3651432415454
3 1.6566584651432
4 9.5168743514354
5 ...
My function looks like this:
import math
import dataframe as df
df1 = df.read_csv('Data1.csv')
df2 = df.read_csv('Data2.csv')
def dfFunction (A, B):
Step = 10
AB = A * B
ABInt = math.floor(AB / Step)
dfValue = df1.iloc[ABInt]
return AB / dfValue
When I input A and B values as int or float, the function works, but when I try to apply the function to df2 (similar to df1 in terms of layout, just additional columns of floats), I'm returning this error.
I've tried df2.apply(dfFunction(df2.ColumnA, df2.ColumnB), axis = 1) and simply dfFunction(df2.ColumnA, df2.ColumnB).
I fundamentally understand the error, since it's highlighting the math.floor() line, but I can't use a float to look up the row index of df1 with a float. Is there another way I can have the function or looking up the data value? I'd just use iloc() if the floats weren't massive decimal places, but the values are means from another portion of the code.
Please let me know if further clarification is needed; I'm only a beginning with Python and Stack :)
I've consulted a bunch of previous related SO posts, but I could not adapt them to solve my question.
Here is an example dataframe.
# Using pandas 0.24.2
data = {'customer_id': [1, 2, 3],
'prev_due_date':['Jun-2010', 'Apr-2019', 'Dec-1999'],
'current_due_date':['Aug-2019', 'Dec-2045', 'Jan-2000'],
'next_due_date':['Feb-2025', 'Nov-2065', 'Sep-2001']
}
df = pd.DataFrame(data)
Here is what the dataframe looks like, and there are many more such columns to parse in actual dataframe, hence my question.
customer_id prev_due_date current_due_date next_due_date
0 1 Jun-2010 Aug-2019 Feb-2025
1 2 Apr-2019 Dec-2045 Nov-2065
2 3 Dec-1999 Jan-2000 Sep-2001
I have created a function to parse one column (ie, this adds two parsed columns --- month and year columns --- to the supplied df)
def parse_column(df, col_parse):
col_parse_mmm = col_parse + '_mmm'
col_parse_yyyy = col_parse + '_yyyy'
df[[col_parse_mmm, col_parse_yyyy]] = df[col_parse].str.split('-', expand=True)
return df
Calling this function below does the job for the supplied column:
parse_column(df, 'prev_due_date')
Now, my question is:
How can I do this for an arbitrary number columns of my choosing (eg, list of of tens or hundreds columns that I want to parse), using apply?
Is it possible to avoid using apply?
for c in df.columns:
if c.endswith('_date'):
parse_column(df, c)
(you don't need return the df in your parse_column function)
If you already have the list with the column names you're interested in:
for c in my_columns_list:
parse_column(df, c)
You don't need any apply.
I have a dataframe with headers 'Category', 'Factor1', 'Factor2', 'Factor3', 'Factor4', 'UseFactorA', 'UseFactorB'.
The value of 'UseFactorA' and 'UseFactorB' are one of the strings ['Factor1', 'Factor2', 'Factor3', 'Factor4'], keyed based on the value in 'Category'.
I want to generate a column, 'Result', which equals dataframe[UseFactorA]/dataframe[UseFactorB]
Take the below dataframe as an example:
[Category] [Factor1] [Factor2] [Factor3] [Factor4] [useFactor1] [useFactor2]
A 1 2 5 8 'Factor1' 'Factor3'
B 2 7 4 2 'Factor3' 'Factor1'
The 'Result' series should be [2, .2]
However, I cannot figure out how to feed the value of useFactor1 and useFactor2 into an index to make this happen--if the columns to use were fixed, I would just give
df['Result'] = df['Factor1']/df['Factor2']
However, when I try to give
df['Results'] = df[df['useFactorA']]/df[df['useFactorB']]
I get the error
ValueError: Wrong number of items passed 3842, placement implies 1
Is there a method for doing what I am trying here?
Probably not the prettiest solution (because of the iterrows), but what comes to mind is to iterate through the sets of factors and set the 'Result' value at each index:
for i, factors in df[['UseFactorA', 'UseFactorB']].iterrows():
df.loc[i, 'Result'] = df[factors['UseFactorA']] / df[factors['UseFactorB']]
Edit:
Another option:
def factor_calc_for_row(row):
factorA = row['UseFactorA']
factorB = row['UseFactorB']
return row[factorA] / row[factorB]
df['Result'] = df.apply(factor_calc_for_row, axis=1)
Here's the one liner:
df['Results'] = [df[df['UseFactorA'][x]][x]/df[df['UseFactorB'][x]][x] for x in range(len(df))]
How it works is:
df[df['UseFactorA']]
Returns a data frame,
df[df['UseFactorA'][x]]
Returns a Series
df[df['UseFactorA'][x]][x]
Pulls a single value from the series.
To pass multiple variables to a normal python function you can just write something like:
def a_function(date,string,float):
do something....
convert string to int,
date = date + (float * int) days
return date
When using Pandas DataFrames I know you can create a new column based on the contents of one like so:
df['new_col']) = df['column_A'].map(a_function)
# This might return the year from a date column
# return date.year
What I'm wondering is in the same way you can pass multiple pieces of data to a single function (as seen in the first example above), can you use multiple columns in the creation of a new pandas DataFrame column?
For example combining three separate parts of a date Y - M - D into one field.
df['whole_date']) = df['Year','Month','Day'].map(a_function)
I get a key error with the following test.
def combine(one,two,three):
return one + two + three
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4],'c': [4,5,6]})
df['d'] = df['a','b','b'].map(combine)
Is there a way of creating a new column in a pandas DataFrame using .map or something else which takes as input three columns and returns a single column?
-> Example input: 1, 2, 3
-> Example output: 1*2*3
Likewise is there also a way of having a function take in one argument, a date and return three new pandas DataFrame columns; one for the year, month and day?
Is there a way of creating a new column in a pandas dataframe using .MAP or something else which takes as input three columns and returns a single column. For example input would be 1, 2, 3 and output would be 1*2*3
To do that, you can use apply with axis=1. However, instead of being called with three separate arguments (one for each column) your specified function will then be called with a single argument for each row, and that argument will be a Series containing the data for that row. You can either account for this in your function:
def combine(row):
return row['a'] + row['b'] + row['c']
>>> df.apply(combine, axis=1)
0 7
1 10
2 13
Or you can pass a lambda which unpacks the Series into separate arguments:
def combine(one,two,three):
return one + two + three
>>> df.apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
If you want to pass only specific rows, you need to select them by indexing on the DataFrame with a list:
>>> df[['a', 'b', 'c']].apply(lambda x: combine(*x), axis=1)
0 7
1 10
2 13
Note the double brackets. (This doesn't really have anything to do with apply; indexing with a list is the normal way to access multiple columns from a DataFrame.)
However, it's important to note that in many cases you don't need to use apply, because you can just use vectorized operations on the columns themselves. The combine function above can simply be called with the DataFrame columns themselves as the arguments:
>>> combine(df.a, df.b, df.c)
0 7
1 10
2 13
This is typically much more efficient when the "combining" operation is vectorizable.
Likewise is there also a way of having a function take in one argument, a date and return three new pandas dataframe columns; one for the year, month and day?
As above, there are two basic ways to do this: a general but non-vectorized way using apply, and a faster vectorized way. Suppose you have a DataFrame like this:
>>> df = pandas.DataFrame({'date': pandas.date_range('2015/05/01', '2015/05/03')})
>>> df
date
0 2015-05-01
1 2015-05-02
2 2015-05-03
You can define a function that returns a Series for each value, and then apply it to the column:
def dateComponents(date):
return pandas.Series([date.year, date.month, date.day], index=["Year", "Month", "Day"])
>>> df.date.apply(dateComponents)
11: Year Month Day
0 2015 5 1
1 2015 5 2
2 2015 5 3
In this situation, this is the only option, since there is no vectorized way to access the individual date components. However, in some cases you can use vectorized operations:
>>> df = pandas.DataFrame({'a': ["Hello", "There", "Pal"]})
>>> df
a
0 Hello
1 There
2 Pal
>>> pandas.DataFrame({'FirstChar': df.a.str[0], 'Length': df.a.str.len()})
FirstChar Length
0 H 5
1 T 5
2 P 3
Here again the operation is vectorized by operating directly on the values instead of applying a function elementwise. In this case, we have two vectorized operations (getting first character and getting the string length), and then we wrap the results in another call to DataFrame to create separate columns for each of the two kinds of results.
I normally use apply for this kind of thing; it's basically the DataFrame version of map (the axis parameter lets you decide whether to apply your function to rows or columns):
df.apply(lambda row: row.a*row.b*row.c, axis =1)
or
df.apply(np.prod, axis=1)
0 8
1 30
2 72