Creating Dataframe column using apply() in Python - python

Say you have a Table T with Columns A and B with numerical values. I want to create a new column C that gives me the ratio of A/B. I know the easy way to do this.
T['C']=T['A']/T['B']
But I want to try using the apply() function to a new copy of Table T. I have the following function below to execute this for any tables.
def ratio(T):
X=T.copy()
def ratio(a,b):
return a/b
X['C']=X['C'].apply(ratio,'A','B')
return X
I get the KeyError: 'C' error. How do I properly get 'C' to exist in order to apply it/

You could simplify this with lambda:
X = T.copy()
X['C'] = T.apply(lambda row: row.A/row.B, axis=1)

Related

What is `pandas.DataFrame.apply` actually operating on?

I have two questions, but first I will give the context. I am trying to use a pandas DataFrame with some existing code using a functional programming approach. I basically want to map a function to every row of a DataFrame, expanding the row using the double-asterisk keyword argument notation, where each column name of the DataFrame corresponds to one of the arguments of the existing function.
For example, say I have the following function.
def line(m, x, b):
y = (m * x) + b
return y
And I have a pandas DataFrame
data = [{"b": 1, "m": 1, "x": 2}, {"b": 2, "m": 2, "x": 3}]
df = pd.DataFrame(data)
# Returns
# b m x
# 0 1 1 2
# 1 2 2 3
Ultimately, I want to construct a column in the DataFrame from the results of line applied to each row; something like the following.
# Note that I'm using the list of dicts defined above, not the DataFrame.
results = [line(**datum) for datum in data]
I feel like I should be able to use some combination of DataFrame.apply, a lambda, probably Series.to_dict, and the double-asterisk keyword argument expansion but I can't figure out what is passed to the lambda in the following expression.
df.apply(lambda x: x, axis=1)
# ^
# What is pandas passing to my identity lambda?
I've tried to inspect with type and x.__class__, but both of the following lines throw TypeErrors.
df.apply(lambda x: type(x), axis=1)
df.apply(lambda x: x.__class__, axis=1)
I don't want to write/refactor a new line function that can wrangle some pandas object because I shouldn't have to. Ultimately, I want to end up with a DataFrame with columns for the input data and a column with the corresponding output of the line function.
My two questions are:
How can I pass a row of a pandas DataFrame to a function using keyword-argument expansion, either using the DataFrame.apply method or some other (functional) approach?
What exactly is DataFrame.apply passing to the function that I specify?
Maybe there is some other functional approach I could take that I'm just not aware of, but I figure pandas is a pretty popular library for this kind of thing and that's why I'm trying to use it. Also there are some data (de)serialization issues I'm facing that pandas should make pretty easy vs. writing a more bespoke solution.
Thanks.
Maybe this is what you are looking for.
1)
df.apply(lambda x: line(**x.to_dict()), axis=1)
Result
0 3
1 8
2)
The function for df.apply(..., axis=1) receives a Series representing a row with the column names as index entries.

Select all rows in Python pandas

I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()

In python, rename variables using parameter of a function

I am creating a function. One input of this function will be a panda dataframe and one of its tasks is to do some operation with two variables of this dataframe. These two variables are not fixed and I want to have the freedom to determine them using parameters as inputs of the function fun.
For example, suppose at some moment the variables I want to use are 'var1' and 'var2' (but at another time, I may want to use others two variables). Supose that these variables take values 1,2,3,4 and I want to reduce df doing var1 == 1 and var2 == 1. My functions is like this
def fun(df , var = ['input_var1', 'input_var2'] , val):
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
# Other operations
df = df.loc[(df.aux_var1 == val ) & (df.aux_var2 == val )]
# end of operations
# recover
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
return df
When I use the function fun, I have the error
fun(df, var = ['var1','var2'], val = 1)
IndexError: list index out of range
Actually, I want to do other more complex operations and I didn't describe these operations so as not to extend the question. Perhaps the simple example above has a solution that does not need to rename the variables. But maybe this solution doesn't work with the operations I really want to do. So first, I would necessarily like to correct the error when renaming the variables. If you want to give another more elegant solution that doesn't need renaming, I appreciate that too, but I will be very grateful if besides the elegant solution, you offer me the solution about renaming.
Python liste are zero indexed, i.e. the first element index is 0.
Just change the lines:
df = df.rename(columns={ var[1] : 'aux_var1 ', var[2]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[1] ,'aux_var2': var[2]})
to
df = df.rename(columns={ var[0] : 'aux_var1 ', var[1]:'aux_var2'})
df = df.rename(columns={ 'aux_var1': var[0] ,'aux_var2': var[1]})
respectively
In this case you are accessing var[2] but a 2-element list in Python has elements 0 and 1. Element 2 does not exist and therefore accessing it is out of range.
As it has been mentioned in other answers, the error you are receiving is due to the 0-indexing of Python lists, i.e. if you wish to access the first element of the list var, you do that by taking the 0 index instead of 1 index: var[0].
However to the topic of renaming, you are able to perform the filtering of pandas dataframe without any column renaming. I can see that you are accessing the column as an attribute of the dataframe, however you are able to achieve the same via utilising the __getitem__ method, which is more commonly used with square brackets, f.e. df[var[0]].
If you wish to have more generality over your function without any renaming happening, I can suggest this:
from functools import reduce
def fun(df , var, val):
_sub = reduce(
lambda x, y: x & (df[y] == val),
var,
pd.Series([True]*df.shape[0])
)
return df[_sub]
This will work with any number of input column variables. Hope this will serve as an inspiration to your more complicated operations you intend to do.

How to loop through pandas dataframe, and conditionally assign values to a row of a variable?

I'm trying to loop through the 'vol' dataframe, and conditionally check if the sample_date is between certain dates. If it is, assign a value to another column.
Here's the following code I have:
vol = pd.DataFrame(data=pd.date_range(start='11/3/2015', end='1/29/2019'))
vol.columns = ['sample_date']
vol['hydraulic_vol'] = np.nan
for i in vol.iterrows():
if pd.Timestamp('2015-11-03') <= vol.loc[i,'sample_date'] <= pd.Timestamp('2018-06-07'):
vol.loc[i,'hydraulic_vol'] = 319779
Here's the error I received:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
This is how you would do it properly:
cond = (pd.Timestamp('2015-11-03') <= vol.sample_date) &
(vol.sample_date <= pd.Timestamp('2018-06-07'))
vol.loc[cond, 'hydraulic_vol'] = 319779
Another way to do this would be to use the np.where method from the numpy module, in combination with the .between method.
This method works like this:
np.where(condition, value if true, value if false)
Code example
cond = vol.sample_date.between('2015-11-03', '2018-06-07')
vol['hydraulic_vol'] = np.where(cond, 319779, np.nan)
Or you can combine them in one single line of code:
vol['hydraulic_vol'] = np.where(vol.sample_date.between('2015-11-03', '2018-06-07'), 319779, np.nan)
Edit
I see that you're new here, so here's something I had to learn as well coming to python/pandas.
Looping over a dataframe should be your last resort, try to use vectorized solutions, in this case .loc or np.where, these will perform better in terms of speed compared to looping.

Whats is the correct way to sum different dataframe columns in a list in pyspark?

I want to sum different columns in a spark dataframe.
Code
from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Why isn't approach #2. & #3. not working?
I am on Spark 2.2
Because,
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using pyspark sum function which takes column as input but you are trying to get it at row level.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.
TL;DR builtins.sum is just fine.
Following your comments:
Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it
and
its not a pypark function so it wont be really be completely benefiting from spark right.
I can see you are making incorrect assumptions.
Let's decompose the problem:
[df[col] for col in ["`A.p1`","`B.p1`"]]
creates a list of Columns:
[Column<b'A.p1'>, Column<b'B.p1'>]
Let's call it iterable.
sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:
accum = iterable[0]
for element in iterable[1:]:
accum = accum + element
This gives Column:
Column<b'(A.p1 + B.p1)'>
which is the same as calling
df["`A.p1`"] + df["`B.p1`"]
No data has been touched and when evaluated it is benefits from all Spark optimizations.
Addition of multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others (might be because of conflict in names)
In your 3rd approach, the expression (inside python's sum function) is returning a PySpark DataFrame.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.

Categories