Apply function over relative rows in Pandas - python

It seems that applying functions to data frames is typically wrt series (e.g df.apply(my_fun)) and so such functions index 'one row at a time'. My question is if one can get more flexibility in the following sense: for a data frame df, write a function my_fun(row) such that we can point to rows above or below the row.
For example, start with the following:
def row_conditional(df, groupcol, appcol1, appcol2, newcol, sortcol, shift):
"""Input: df (dataframe): input data frame
groupcol, appcol1, appcol2, sortcol (str): column names in df
shift (int): integer to point to a row above or below current row
Output: df with a newcol appended based on conditions
"""
df[newcol] = '' # fill new col with blank str
list_results = []
members = set(df[groupcol])
for m in members:
df_m = df[df[groupcol]==m].sort(sortcol, ascending=True)
df_m = df_m.reset_index(drop=True)
numrows_m = df_m.shape[0]
for r in xrange(numrows_m):
# CONDITIONS, based on rows above or below
if (df_m.loc[r + shift, appcol1]>0) and (df_m.loc[r - shfit, appcol2]=='False'):
df_m.loc[r, newcol] = 'old'
else:
df_m.loc[r, newcol] = 'new'
list_results.append(df_m)
return pd.concat(list_results).reset_index(drop=True)
Then, I'd like to be able to re-write the above as:
def new_row_conditional(row, shift):
"""apply above conditions to row relative to row[shift, appcol1] and row[shift, appcol2]
"""
return new value at df.loc[row, newcol]
and finally execute:
df.apply(new_row_conditional)
Thoughts/Solutions with 'map' or 'transform' are also very welcome.
From an OO-approach, I might imagine a row of df to be treated as an object that has attributes i) a pointer to all rows above it and ii) a pointer to all rows below it. Then referencing row.above and row.below in order to assign the new value at df.loc[row, newcol]

One can always look into the enclosing execution frame:
import pandas
dataf = pandas.DataFrame({'a':(1,2,3), 'b':(4,5,6)})
import sys
def foo(roworcol):
# current index along the axis
axis_i = sys._getframe(1).f_locals['i']
# data frame the function is applied to
dataf = sys._getframe(1).f_locals['self']
axis = sys._getframe(1).f_locals['axis']
# number of elements along the chosen axis
n = dataf.shape[(1,0)[axis]]
# print where we are
print('index: %i - %i items before, %i items after' % (axis_i,
axis_i,
n-axis_i-1))
Within the function function foo there is:
roworcol the current element out of the iteration
axis the chosen axis
axis_i the index of along the chosen axis
dataf the data frame
This is all needed to point before and after in the data frame.
>>> dataf.apply(foo, axis=1)
index: 0 - 0 items before, 2 items after
index: 1 - 1 items before, 1 items after
index: 2 - 2 items before, 0 items after
A complete implementation of the specific example you added in the comments would then be:
import pandas
import sys
df = pandas.DataFrame({'a':(1,2,3,4), 'b':(5,6,7,8)})
def bar(row, k):
axis_i = sys._getframe(2).f_locals['i']
# data frame the function is applied to
dataf = sys._getframe(2).f_locals['self']
axis = sys._getframe(2).f_locals['axis']
# number of elements along the chosen axis
n = dataf.shape[(1,0)[axis]]
if axis_i == 0 or axis_i == (n-1):
res = 0
else:
res = dataf['a'][axis_i - k] + dataf['b'][axis_i + k]
return res
You'll note that whenever additional arguments are present in the signature of the function mapped we need jump 2 frames up.
>>> df.apply(bar, args=(1,), axis=1)
0 0
1 8
2 10
3 0
dtype: int64
You'll also note that the specific example you have provided can be solved by other, and possibly simpler, means. The solution above is very general in the sense that it is letting you use map while jailbreaking from the row being mapped but it may also violate assumption about what map is doing and, for example. deprive you from the opportunity to parallelize easily by assuming independent computation on the rows.

Create duplicate data frames that are index shifted, and loop over their rows in parallel.
df_pre = df.copy()
df_pre.index -= 1
result = [fun(x1, x2) for x1, x2 in zip(df_pre.iterrows(), df.iterrows()]
This assumes you actually want everything from that row. You can of course do direct operations for example
result = df_pre['col'] - df['col']
Also, there are some standard processing functions built in like diff, shift, cumsum, cumprod that do operate on adjacent rows, although the scope is limited.

Related

'Oversampling' cartesian data in a dataframe without for loop?

I have a 3D data in a pandas dataframe that I would like to 'oversample'/smooth by replacing the value at each x,y point with the average value of all the points that are within 5 units of that point. I can do it using a for loop like this (starting with a dataframe with three columns X,Y,Z):
import pandas as pd
Z_OS = []
X_OS = []
Y_OS = []
for inddex, row in df.iterrows():
Z_OS += [df[(df['X'] > row['X']-5) & (df['X']<row['X']+5) & (df['Y'] > row['Y']-5) & (df1['Y']<row['Y']+5)]['Z'].mean()]
X_OS += [row['X']]
Y_OS += [row['Y']]
dict = {
'X': X_OS,
'Y': Y_OS,
'Z': Z_OS
}
OSdf = pd.DataFrame.from_dict(dict)
but this method is very slow for large datasets and feels very 'unpythonic'. How could I do this without for loops? Is it possible via complex use of the groupby function?
xy = df[['x','y']]
df['smoothed z'] = df[['z']].apply(
lambda row: df['z'][(xy - xy.loc[row.name]).abs().lt(5).all(1)].mean(),
axis=1
)
Here I used df[['z']] to get a column 'z' as a data frame. We need an index of a row, i.e. row.name, when we apply a function to this column.
.abs().lt(5).all(1) read as absolut values which are all less then 5 along the row.
Update
The code below is actually the same but seems more consistent as it addresses directly the index:
df.index.to_series().apply(lambda i: df.loc[(xy - xy.loc[i]).abs().lt(5).all(1), 'z'].mean())
df['column_name'].rolling(rolling_window).mean()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

Loop through Pandas dataframe and copying to a new dataframe based on a condition

I have a dataframe df with 6000+ rows of data with a datetime index in the form YYYY-MM-DD and with columns ID, water_level and change.
I want to:
Loop through each value in the column change and identify turning points
When I find a turning point, copy that entire row of data including the index into a new dataframe e.g. turningpoints_df
For each new turning point identified in the loop, add that row of data to my new dataframe turningpoints_df so that I end up with something like this:
ID water_level change
date
2000-10-01 2 5.5 -0.01
2000-12-13 40 10.0 0.02
2001-02-10 150 1.1 -0.005
2001-07-29 201 12.4 0.01
... ... ... ...
I was thinking of taking a positional approach so something like (purely illustrative):
turningpoints_df = pd.DataFrame(columns = ['ID', 'water_level', 'change'])
for i in range(len(df['change'])):
if [i-1] < 0 and [i+1] > 0:
#this is a min point and take this row and copy to turningpoints_df
elif [i-1] > 0 and [i+1] < 0:
#this is a max point and take this row and copy to turningpoints_df
else:
pass
My issue is, is that I'm not sure how to examine each value in my change column against the value before and after and then how to pull out that row of data into a new df when the conditions are met.
it sounds like you want to make use of the shift method of the DataFrame.
# shift values down by 1:
df[change_down] = df[change].shift(1)
# shift values up by 1:
df[change_up] = df[change].shift(-1)
you should then be able to compare the values of each row and proceed with whatever you're trying to achieve..
for row in df.iterrows():
*check conditions here*
Using some NumPy features that allows you to roll() a series forwards or backwards. Then have prev and next on same row so can then use a simple function to apply() your logic as everything is on same row.
from decimal import *
import numpy as np
d = list(pd.date_range(dt.datetime(2000,1,1), dt.datetime(2010,12,31)))
df = pd.DataFrame({"date":d, "ID":[random.randint(1,200) for x in d],
"water_level":[round(Decimal(random.uniform(1,13)),2) for x in d],
"change":[round(Decimal(random.uniform(-0.05, 0.05)),3) for x in d]})
# have ref to prev and next, just apply logic
def turningpoint(r):
r["turningpoint"] = (r["prev_change"] < 0 and r["next_change"] > 0) or \
(r["prev_change"] > 0 and r["next_change"] < 0)
return r
# use numpy to shift "change" so have prev and next on same row as new columns
# initially default turningpoint boolean
df = df.assign(prev_change=np.roll(df["change"],1),
next_change=np.roll(df["change"],-1),
turningpoint=False).apply(turningpoint, axis=1).drop(["prev_change", "next_change"], axis=1)
# first and last rows cannot be turning points
df.loc[0:0,"turningpoint"] = False
df.loc[df.index[-1], "turningpoint"] = False
# take a copy of all rows that are turningpoints into new df with index
df_turningpoint = df[df["turningpoint"]].copy()
df_turningpoint

How to fill pandas dataframes in a loop?

I am trying to build a subset of dataframes from a larger dataframe by searching for a string in the column headings.
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for well in wells:
wellname = well
well = pd.DataFrame()
well_cols = [col for col in cdf.columns if wellname in col]
well = cdf[well_cols]
I am trying to search for the wellname in the cdf dataframe columns and put those columns which contain that wellname into a new dataframe named the wellname.
I am able to build my new sub dataframes but the dataframes come up empty of size (0, 0) while cdf is (21973, 91).
well_cols also populates correctly as a list.
These are some of cdf column headings. Each column has 20k rows of data.
Index(['N1_Inj_Casing_Gas_Valve', 'N1_LT_Stm_Rate', 'N1_ST_Stm_Rate',
'N1_Inj_Casing_Gas_Flow_Rate', 'N1_LT_Stm_Valve', 'N1_ST_Stm_Valve',
'N1_LT_Stm_Pressure', 'N1_ST_Stm_Pressure', 'N1_Bubble_Tube_Pressure',
'N1_Inj_Casing_Gas_Pressure', 'N2_Inj_Casing_Gas_Valve',
'N2_LT_Stm_Rate', 'N2_ST_Stm_Rate', 'N2_Inj_Casing_Gas_Flow_Rate',
'N2_LT_Stm_Valve', 'N2_ST_Stm_Valve', 'N2_LT_Stm_Pressure',
'N2_ST_Stm_Pressure', 'N2_Bubble_Tube_Pressure',
'N2_Inj_Casing_Gas_Pressure', 'N3_Inj_Casing_Gas_Valve',
'N3_LT_Stm_Rate', 'N3_ST_Stm_Rate', 'N3_Inj_Casing_Gas_Flow_Rate',
'N3_LT_Stm_Valve', 'N3_ST_Stm_Valve', 'N3_LT_Stm_Pressure',
I want to create a new dataframe with every heading that contains the "well" IE a new dataframe for all columns & data with column name containing N1, another for N2 etc.
The New dataframes populate correctly when inside the loop but disappear when the loop breaks... a bit of the code output for print(well):
[27884 rows x 10 columns]
N9_Inj_Casing_Gas_Valve ... N9_Inj_Casing_Gas_Pressure
0 74.375000 ... 2485.602364
1 74.520833 ... 2485.346000
2 74.437500 ... 2485.341091
IIUC this should be enough:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dict={}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dict[well] = cdf[well_cols]
Dictionaries are usually the way to go if you want to populate something. In this case, then, if you input well_dict['N1'], you'll get your first dataframe, and so on.
The elements of an array are not mutable when iterating over it. That is, here's what it's doing based on your example:
# 1st iteration
well = 'N1' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N1' in name> # assigned by `well = cdf[well_cols]`
# 2nd iteration
well = 'N2' # assigned by the for loop directive
...
well = <empty DataFrame> # assigned by `well = pd.DataFrame()`
...
well = <DataFrame, subset of cdf where col has 'N2' in name> # assigned by `well = cdf[well_cols]`
...
But at no point did you change the array, or store the new dataframes for that matter (although you would still have the last dataframe stored in well at the end of the iteration).
IMO, it seems like storing the dataframes in a dict would be easier to use:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
well_dfs = {}
for well in wells:
well_cols = [col for col in cdf.columns if well in col]
well_dfs[well] = cdf[well_cols]
However, if you really want it in a list, you could do something like:
df=pd.read_csv('data.csv')
cdf = df.drop(['DateTime'], axis=1)
wells = ['N1','N2','N3','N4','N5','N6','N7','N8','N9']
for ix, well in enumerate(wells):
well_cols = [col for col in cdf.columns if well in col]
wells[ix] = cdf[well_cols]
One way to approach the problem is to use pd.MultiIndex and Groupby.
You can add the construct a MultiIndex composed of well identifier and variable name. If you have df:
N1_a N1_b N2_a N2_b
1 2 2 3 4
2 7 8 9 10
You can use df.columns.str.split('_', expand=True) to parse the well identifer corresponding variable name (i.e. a or b).
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
Which returns:
N1 N2
a b a b
0 2 2 3 4
1 7 8 9 10
Then you can transpose the data frame and groupby the MultiIndex level 0.
grouped = df.T.groupby(level=0)
To return a list of untransposed sub-data frames you can use:
wells = [group.T for _, group in grouped]
where wells[0] is:
N1
a b
0 2 2
1 7 8
and wells[1] is:
N2
a b
0 3 4
1 9 10
The last step is rather unnecessary because the data can be accessed from the grouped object grouped.
All together:
import pandas as pd
from io import StringIO
data = """
N1_a,N1_b,N2_a,N2_b
1,2,2,3,4
2,7,8,9,10
"""
df = pd.read_csv(StringIO(data))
# Parse Column names to add well name to multiindex level
df = pd.DataFrame(df.values, columns=df.columns.str.split('_', expand=True)).sort_index(1)
# Group by well name
grouped = df.T.groupby(level=0)
#bulist list of sub dataframes
wells = [group.T for _, group in grouped]
Using contains
df[df.columns.str.contains('|'.join(wells))]

Filling each row of one column of a DataFrame with different values (a random distribution)

I have a DataFrame with aprox. 4 columns and 200 rows. I created a 5th column with null values:
df['minutes'] = np.nan
Then, I want to fill each row of this new column with random inverse log normal values. The code to generate 1 inverse log normal:
note: if the code bellow is ran multiple times it will generate a new result because of the value inside ppf() : random.random()
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
What's happening when I do that is that it's filling all 200 rows of df['minutes'] with the same number, instead of triggering the random.random() for each row as I expected it to.
What do I have to do? I tried using for loopbut apparently I'm not getting it right (giving the same results):
for i in range(1,len(df)):
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
what am I doing wrong?
Also, I'll add that later I'll need to change some parameters of the inverse log normal above if the value of another column is 0 or 1. as in:
if df['type'] == 0:
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
elif df['type'] == 1:
df['minutes'] = df['minutes'].fillna(stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int))
thanks in advance.
The problem with your use of fillna here is that this function takes a value as argument and applies it to every element along the specified axis. So your stat value is calculated once and then distributed into every row.
What you need is your function called for every element on the axis, so your argument must be the function itself and not a value. That's a job for apply, which takes a function and applies it on elements along an axis.
I'm straight jumping to your final requirements:
You could use apply just on the minutes-column (as a pandas.Series method) with a lambda-function and then assign the respective results to the type-column filtered rows of column minutes:
import numpy as np
import pandas as pd
import scipy.stats as stats
import random
# setup
df = pd.DataFrame(np.random.randint(0, 2, size=(8, 4)),
columns=list('ABC') + ['type'])
df['minutes'] = np.nan
df.loc[df.type == 0, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int),
convert_dtype=False))
df.loc[df.type == 1, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int),
convert_dtype=False))
... or you use apply as a DataFrame method with a function wrapping your logic to distinguish between values of type-column and assign the result back to the minutes-column:
def calc_minutes(row):
if row['type'] == 0:
return stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int)
elif row['type'] == 1:
return stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int)
df['minutes'] = df.apply(calc_minutes, axis=1)
Managed to do it with some steps with a different mindset:
Created 2 lists, each with i's own parameters
Used NumPy's append
so that for each row a different random number
lognormal_tone = []
lognormal_ttwo = []
for i in range(len(s)):
lognormal_tone.append(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
lognormal_ttwo.append(stats.lognorm(0.4, scale=np.exp(2.7)).ppf(random.random()).astype(int))
Then, included them in the DataFrame with another previously created list:
df = pd.DataFrame({'arrival':arrival,'minTypeOne':lognormal_tone, 'minTypeTwo':lognormal_two})

Python: Iterate over dataframes of different lengths, and compute new value with repeat values

EDIT:
I realized I set up my example incorrectly, the corrected version follows:
I have two dataframes:
df1 = pd.DataFrame({'x values': [11, 12, 13], 'time':[1,2.2,3.5})
df2 = pd.DataFrame({'x values': [11, 21, 12, 43], 'time':[1,2.1,2.6,3.1})
What I need to do is iterate over both of these dataframes, and compute a new value, which is a ratio of the x values in df1 and df2. The difficulty comes in because these dataframes are of different lengths.
If I just wanted to compute values in the two, I know that I could use something like zip, or even map. Unfortunately, I don't want to drop any values. Instead, I need to be able to compare the time column between the two frames to determine whether or not to copy over a value from a previous time to the computation of in the next time period.
So for instance, I would compute the first ratio:
df1["x values"][0]/df2["x values"][0]
Then for the second I check which update happens next, which in this case is to df2, so df1["time"] < df2["time"] and:
df1["x values"][0]/df2["x values"][1]
For the third I would see that df1["time"] > df2["time"], so the third computation would be:
df1["x values"][1]/df2["x values"][1]
The only time both values should be used to compute the ratio from the same "position" is if the times in the two dataframes are equal.
And so on. I'm very confused as to whether or not this is possible to execute using something like a lambda function, or itertools. I've made some attempts, but most have yielded errors. Any help would be appreciated.
Here is what I ended up doing. Hopefully it helps clarify what my question was. Also, if anyone can think of a more pythonic way to do this, I would appreciate the feedback.
#add a column indicating which 'type' of dataframe it is
df1['type']=pd.Series('type1',index=df1.index)
df2['type']=pd.Series('type2',index=df2.index)
#concatenate the dataframes
df = pd.concat((df1, df2),axis=0, ignore_index=True)
#sort by time
df = df.sort_values(by='time').reset_index()
#we create empty arrays in order to track records
#in a way that will let us compute ratios
x1 = []
x2 = []
#we will iterate through the dataframe line by line
for i in range(0,len(df)):
#if the row contains data from df1
if df["type"][i] == "type1":
#we append the x value for that type
x1.append(df[df["type"]=="type1"]["x values"][i])
#if the x2 array contains exactly 1 value
if len(x2) == 1:
#we add it to match the number of x1
#that we have recorded up to that point
#this is useful if one time starts before the other
for j in range(1, len(x1)-1):
x2.append(x2[0])
#if the x2 array contains more than 1 value
#add a copy of the previous x2 record to correspond
#to the new x1 record
if len(x2) > 0:
x2.append(x2[len(x2)-1])
#if the row contains data from df2
if df["type"][i] == "type2":
#we append the x value for that type
x2.append(df[df["type"]=="type2"]["x values"][i])
#if the x1 array contains exactly 1 value
if len(x1) == 1:
#we add it to match the number of x2
#that we have recorded up to that point
#this is useful if one time starts before the other
for j in range(1, len(x2)-1):
x1.append(x2[0])
#if the x1 array contains more than 1 value
#add a copy of the previous x1 record to correspond
#to the new x2 record
if len(x1) > 0:
x1.append(x1[len(x1)-1])
#combine the records
new__df = pd.DataFrame({'Type 1':x1, 'Type 2': x2})
#compute the ratio
new_df['Ratio'] = new_df['x1']/f_df['x2']
You can merge the two dataframes on time and then calculate ratios
new_df = df1.merge(df2, on = 'time', how = 'outer')
new_df['ratio'] = new_df['x values_x'] / new_df['x values_y']
You get
time x values_x x values_y ratio
0 1 11 11 1.000000
1 2 12 21 0.571429
2 2 12 12 1.000000
3 3 13 43 0.302326

Categories