I have a quite specific question pertaining to how ".loc" function works on the backend when 1. applied directly to a daraframe (ex. df.loc[]) as opposed to being used in a defined method and then applied using "df.apply()".
Here is the MultiIndex dataframe structure I am working with.
[My DataFrame 1]
#Sample Function
def sample(df):
for i in df:
val = df.loc['deep_impressions'] > 0
return val.sum()
df.apply(sample, axis=1)
The above code uses .loc without row/column indication by simply passing the outer column label and when applied to the DataFrame, returns the correct output, which is the sum of the 2 columns under te "deep_impressions" outer column index.
However, when applying the same logic not using a defined method, I must explicitly state that all rows, and only "deep_impressions" columns are to be summed.
df.loc[:,'deep_impressions'] > 0
df.sum(axis=1)
df
Why doesn't python require me to explicitly state (.loc[:,"deep_impressions]) when used in a defined method? How does it work on the backend?
Related
Considering a function (apply_this_function) that will be applied to a dataframe:
# our dataset
data = {"addresses": ['Newport Beach, California', 'New York City', 'London, England', 10001, 'Sydney, Au']}
# create a throw-away dataframe
df_throwaway = df.copy()
def apply_this_function(passed_row):
passed_row['new_col'] = True
passed_row['added'] = datetime.datetime.now()
return passed_row
df_throwaway.apply(apply_this_function, axis=1) # axis=1 is important to use the row itself
In df_throway.appy(.), where does the function take the "passed_row" parameter? Or what value is this function taking? My assumption is that by the structure of apply(), the function takes values from row i starting at 1?
I am referring to the information obtained here
When you apply a function to a DataFrame with axis=1, then
this function is called for each row from the source DataFrame
and by convention its parameter is called row.
In your case this function returns (from each call) the original
row (actually a Series object), with 2 new elements added.
Then apply method collects these rows, concatenates them
and the result is a DataFrame with 2 new columns.
You wrote takes values from row i starting at 1. I would change it to
takes values from each row.
Writing starting at 1 can lead to misunderstandings, since when your
DataFrame has a default index, its values start from 0 (not from 1).
In addition, I would like to propose 2 corrections to your code:
Create your DataFrame passing data (your code sample does not
contain creation of df):
df_throwaway = pd.DataFrame(data)
Define your function as:
def apply_this_function(row):
row['new_col'] = True
row['added'] = pd.Timestamp.now()
return row
i.e.:
name the parameter as just row (everybody knows that this row
has been passed by apply method),
instead of datetime.datetime.now() use pd.Timestamp.now()
i.e. a native pandasonic type and its method.
We have a project where we have multiple *.py scripts with functions that receive and return pandas dataframe variable(s) as arguments.
But this make me wonder: What is the behavior in memory of the dataframe variable when they are passed as argument or as returned variables from those functions?
Does modifying the df variable alters the parent/main/global variable as well?
Consider the following example:
import pandas as pd
def add_Col(df):
df["New Column"] = 10 * 3
def mod_Col(df):
df["Existing Column"] = df["Existing Column"] ** 2
data = [0,1,2,3]
df = pd.DataFrame(data,columns=["Existing Column"])
add_Col(df)
mod_col(df)
df
When df is displayed at the end: Will the new Column show up? what about the change made to "Existing Column" when calling mod_col?
Did invoking add_Col function create a copy of df or only a pointer?
What is the best practice when passing dataframes into functions becuase if they are large enough I am sure creating copies will have both performance and memory implications right?
It depends. DataFrames are mutable objects, so like lists, they can be modified within a function, without needing to return the object.
On the other hand, the vast majority of pandas operations will return a new object so modifications would not change the underlying DataFrame. For instance, below you can see that changing values with .loc will modify the original, but if you were to multiply the entire DataFrame (which returns a new object) the original remains unchanged.
If you had a function that has a combination of both types of changes of these you could modify your DataFrame up to the point that you return a new object.
Changes the original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df.loc[1,0] = 7
mutate_data(df)
print(df)
# 0
#0 1
#1 7
#2 4
Will not change original
df = pd.DataFrame([1,2,4])
def mutate_data(df):
df = df*2
mutate_data(df)
print(df)
# 0
#0 1
#1 2
#2 4
What should you do?
If the purpose of a function is to modify a DataFrame, like in a pipeline, then you should create a function that takes a DataFrame and returns the DataFrame.
def add_column(df):
df['new_column'] = 7
return df
df = add_column(df)
#┃ ┃
#┗ on lhs & rhs ┛
In this scenario it doesn't matter if the function changes or creates a new object, because we intend to modify the original anyway.
However, that may have unintended consequences if you plan to write to a new object
df1 = add_column(df)
# | |
# New Obj Function still modifies this though!
A safe alternative that would require no knowledge of the underlying source code would be to force your function to copy at the top. Thus in that scope changes to df do not impact the original df outside of the function.
def add_column_maintain_original(df):
df = df.copy()
df['new_column'] = 7
return df
Another possibility is to pass a copy to the function:
df1 = add_column(df.copy())
yes the function will indeed change the data frame itself without creating a copy of it. You should be careful of it because you might end up having columns changed without you noticing.
In my opinion the best practice depend on use cases and using .copy() will indeed have an impact on your memory.
If for instance you are creating a pipeline with some dataframe as input you do not want to change the input dataframe itself. While if you are just processing a dataframe and you are splitting the processing in different function you can write the function how you did it
I have a function that aims at printing the sum along a column of a pandas DataFrame after filtering on some rows to be defined ; and the percentage this quantity makes up in the same sum without any filter:
def my_function(df, filter_to_apply, col):
my_sum = np.sum(df[filter_to_apply][col])
print(my_sum)
print(my_sum/np.sum(df[col]))
Now I am wondering if there is any way to have a filter_to_apply that actually doesn't do any filter (i.e. keeps all rows), to keep using my function (that is actually a bit more complex and convenient) even when I don't want any filter.
So, some filter_f1 that would do: df[filter_f1] = df and could be used with other filters: filter_f1 & filter_f2.
One possible answer is: df.index.isin(df.index) but I am wondering if there is anything easier to understand (e.g. I tried to use just True but it didn't work).
A Python slice object, i.e. slice(-1), acts as an object that selects all indexes in a indexable object. So df[slice(-1)] would select all rows in the DataFrame. You can store that in a variable an an initial value which you can further refine in your logic:
filter_to_apply = slice(-1) # initialize to select all rows
... # logic that may set `filter_to_apply` to something more restrictive
my_function(df, filter_to_apply, col)
This is a way to select all rows:
df[range(0, len(df))]
this is also
df[:]
But I haven't figured out a way to pass : as an argument.
Theres a function called loc on pandas that filters rows. You could do something like this:
df2 = df.loc[<Filter here>]
#Filter can be something like df['price']>500 or df['name'] == 'Brian'
#basically something that for each row returns a boolean
total = df2['ColumnToSum'].sum()
I am trying to select some rows from a pandas dataframe and store the subset/selection into a variable so I can perform multiple operations on this subset (including modification) without having to do the selection again. But I don't quite understand why it doesn't work.
For example, this doesn't work as expected (the original df doesn't get modified):
df = pd.DataFrame({"a":list(range(1,3))})
subDf = df.loc[df.a==2,:]
subDf.loc[:,"a"] = -1 # also throws SettingWithCopyWarning
# ... do more stuff with subDf...
But, this works as expected:
df = pd.DataFrame({"a":list(range(1,3))})
mask = (df.a==2)
df.loc[mask,"a"] = -1
After reading the pandas docs on indexing view vs copy, I was under the impression that selecting via .loc will return a view, but apparently that's not the case given the SettingWithCopyWarning. What am I misunderstanding here?
In subDf = df.loc[df.a==2,:] the method you are using is actually __getitem__ (df.loc.__getitem__) which is not guaranteed to return a view. When you assign something to loc (for example df.loc[mask,"a"] = -1) you are actually calling __setitem__ (df.loc.__setitem__). Here, since it has to assign a value to that slice, it is guaranteed to be a view.
I want to sum different columns in a spark dataframe.
Code
from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Why isn't approach #2. & #3. not working?
I am on Spark 2.2
Because,
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using python in-built sum function which takes iterable as input,so it works. https://docs.python.org/2/library/functions.html#sum
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Here you are using pyspark sum function which takes column as input but you are trying to get it at row level.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Here, df.select() returns a dataframe and trying to sum over a dataframe. In this case, I think, you got to iterate rowwise and apply sum over it.
TL;DR builtins.sum is just fine.
Following your comments:
Using native python sum() is not benefitting from spark optimization. so whats the spark way of doing it
and
its not a pypark function so it wont be really be completely benefiting from spark right.
I can see you are making incorrect assumptions.
Let's decompose the problem:
[df[col] for col in ["`A.p1`","`B.p1`"]]
creates a list of Columns:
[Column<b'A.p1'>, Column<b'B.p1'>]
Let's call it iterable.
sum reduces output by taking elements of this list and calling __add__ method (+). Imperative equivalent is:
accum = iterable[0]
for element in iterable[1:]:
accum = accum + element
This gives Column:
Column<b'(A.p1 + B.p1)'>
which is the same as calling
df["`A.p1`"] + df["`B.p1`"]
No data has been touched and when evaluated it is benefits from all Spark optimizations.
Addition of multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others (might be because of conflict in names)
In your 3rd approach, the expression (inside python's sum function) is returning a PySpark DataFrame.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.