apply a function to each row of the dataframe - python

What is a more elegant way of implementing below?
I want to apply a function: my_function to a dataframe where each row of the dataframe contains the parameters of the function. Then I want to write the output of the function back to the dataframe row.
results = pd.DataFrame()
for row in input_panel.iterrows():
(index, row_contents) = row
row_contents['target'] = my_function(*list(row_contents))
results = pd.concat([results, row_contents])

We'll iterate through the values and build a DataFrame at the end.
results = pd.DataFrame([my_function(*x) for x in input_panel.values.tolist()])
The less recommended method is using DataFrame.apply:
results = input_panel.apply(lambda x: my_function(*x))
The only advantage of apply is less typing.

Related

How to save the change for pandas dataframe after iterating by row?

I create a simple function to replace a certain column in df by row:
def replace(df):
for index, row in df.iterrows():
row['ALARM_TEXT'] = row['ALARM_TEXT'].replace('\'','')
return df
But the input df has not been changed after I call the function. Is there something wrong with it?
We usually do
df['ALARM_TEXT'] = df['ALARM_TEXT'].str.replace('\'','')

Efficient way to unnest pandas dataframe

I'm accessing a fairly large series of json files and storing them in a pandas series, part of a larger dataframe. There are several fields I want in said json, some of which are nested. I've been extracting them using json_normalize. The goal in to then merge these new fields with my original dataframe.
My problem is when I do so, instead of getting a dataframe with J rows and K columns, I get a J length series with each element being 1xK dataframe. I'm wondering if there is either an efficient vectorized way to turn this nested series/dataframe into a regular dataframe or get a regular dataframe from the start.
I've used map/lambda to create my nested series. Right now I'm unnesting with iteritems/append, but there has to be a more efficient way.
url_base = 'http:\\foo.bar='
df['http'] = df['id'].map(lambda x: url_base + x)
df['json'] = df['http'].map(lambda x: nf.get_json(x))
nest_ser = df['json'].map(lambda x: json_normalize(x))
df = pd.DataFrame()
for index, item in nest_ser.iteritems():
df = df.append(item)
json_normalize produces:
pd.Series([pd.DataFrame([col1,col2...]),[pd.DataFrame([col1,col2...]),[pd.DataFrame([col1,col2...]))
instead of
pd.DataFrame([col1,col2...])
suppose your name of the output series out of json_normalize is sr:
pd.concat(sr.tolist())

Pandas in Dataframe

I am posting this and hoping I will get a convincing answer.
df is my dataframe. I want to know what is being passed to min_max in apply function. When I print row inside min_max I don't get a dataframe same as I get outside it
import numpy as np
def min_max(row):
print(row)
print()
data = row[['POPESTIMATE2010',
'POPESTIMATE2011',
'POPESTIMATE2012',
'POPESTIMATE2013',
'POPESTIMATE2014',
'POPESTIMATE2015']]
return pd.Series({'min': np.min(data), 'max': np.max(data)})
df.apply(min_max, axis=1)
df.apply simply calls/invokes provided function, in your case min_max function for each objects in input axis. From documentation of apply function, axis=1 represents row wise operation and axis=0 represents column wise operation
Thus, in your case, it will invoke min_max function for each row of dataframe.
For further elaboration.
def print_funt(row):
pdb.set_trace()
print(row)
df = pd.DataFrame({'Temp1':[62,62,50,62,50,62,62],
'Temp2':[66,66,69,66,69,66,66],
'Temp3':[52,62,52,62,52,62,52],
'Target':[0.24,0.28,0.25,0.28,0.25,0.28,0.24]})
print(df)
df.apply(print_funt, axis=1)
output of apply function at first iteration

How do you effectively use pd.DataFrame.apply on rows with duplicate values?

The function that I'm applying is a little expensive, as such I want it to only calculate the value once for unique values.
The only solution I've been able to come up with has been as follows:
This step because apply doesn't work on arrays, so I have to convert the unique values into a series.
new_vals = pd.Series(data['column'].unique()).apply(function)
This one because .merge has to be used on dataframes.
new_dataframe = pd.DataFrame( index = data['column'].unique(), data = new_vals.values)
Finally Merging The results
yet_another= pd.merge(data, new_dataframe, right_index = True, left_on = column)
data['calculated_column'] = yet_another[0]
So basically I had to Convert my values to a Series, apply the function, convert to a Dataframe, merge the results and use that column to create me new column.
I'm wondering if there is some one-line solution that isn't as messy. Something pythonic that doesn't involve re-casting object types multiple times. I've tried grouping by but I just can't figure out how to do it.
My best guess would have been to do something along these lines
data[calculated_column] = dataframe.groupby(column).index.apply(function)
but that isn't right either.
This is an operation that I do often enough to want to learn a better way to do, but not often enough that I can easily find the last time I used it, so I end up re-figuring a bunch of things again and again.
If there is no good solution I guess I could just add this function to my library of common tools that I hedonistically > from me_tools import *
def apply_unique(data, column, function):
new_vals = pd.Series(data[column].unique()).apply(function)
new_dataframe = pd.DataFrame( data = new_vals.values, index =
data[column].unique() )
result = pd.merge(data, new_dataframe, right_index = True, left_on = column)
return result[0]
I would do something like this:
def apply_unique(df, orig_col, new_col, func):
return df.merge(df[[orig_col]]
.drop_duplicates()
.assign(**{new_col: lambda x: x[orig_col].apply(func)}
), how='inner', on=orig_col)
This will return the same DataFrame as performing:
df[new_col] = df[orig_col].apply(func)
but will be much more performant when there are many duplicates.
How it works:
We join the original DataFrame (calling) to another DataFrame (passed) that contains two columns; the original column and the new column transformed from the original column.
The new column in the passed DataFrame is assigned using .assign and a lambda function, making it possible to apply the function to the DataFrame that has already had .drop_duplicates() performed on it.
A dict is used here for convenience only, as it allows a column name to be passed in as a str.
Edit:
As an aside: best to drop new_col if it already exists, otherwise the merge will append suffixes to each new_col
if new_col in df:
df = df.drop(new_col, axis='columns')

What is the `pandas` way to create a column in a dataframe by operating on each row?

I have an apply function that operates on each row in my dataframe. The result of that apply function is a new value. This new value is intended to go in a new column for that row.
So, after applying this function to all of the rows in the dataframe, there will be an entirely new column in that dataframe.
How do I do this in pandas?
Two ways primarily:
df['new_column'] = df.apply(my_fxn, axis=1)
or
df = df.assign(new_column=df.apply(my_fxn, axis=1))
If you need to use other arguments, you can pass them to the apply function, but sometimes it's easier (for me) to just use a lambda:
df['new_column'] = df.apply(lambda row: my_fxn(row, global_dict), axis=1)
Additionally, if your function can operate on arrays in a vectorized fashion, you could just do:
df['new_column'] = my_fxn(df['col1'], df['col2'])

Categories