Python Pandas Group Recursion - python

The question I have is closely related to this post.
Assume I have the following dataset:
df = pd.DataFrame({"A":range(1,10), "B":range(5,14), "Group":
[1,1,2,2,2,2,3,3,3],"C":[0,0,10,0,0,16,0,0,22], "last":[0,1,0,0,0,1,0,0,1],
"Want": [19.25,8,91.6,71.05,45.85,16,104.95,65.8,22]})
The last observation for the group is straight forward. This is how the code looks like:
def calculate(df):
if (df.last == 1):
value = df.loc["A"] + df.loc["B"]
else:
for all other observation PER GROUP, the row value is calculated as follows:
value = (df.loc[i-1, "C"] + 3 * df.loc[i, "A"] + 1.65 * df.loc[i, "B"])
return value
To further clarify, these are the formulas for calculating the Want column for Group 2 using excel: F4="F5+(3*A4)+(1.65*B4)", F5="F6+(3*A5)+(1.65*B5)", F6="F7+(3*A6)+(1.65*B6)", F7="A7+B7". There's some kind of "recursive" nature to it, which is why I thought of the "for loop"
I would really appreciate a solution where it's consistent with the first if statement. That is
value = something
rather than the function returning a data frame or something like that, so that I can call the function using the following
df["value"] = df.apply(calculate, axis=1)
Your help is appreciated. Thanks

You don't need apply here. Usually, apply is very slow and you'll want to avoid that.
Problems with this recursive characteristic, however, are usually hard to vectorize. Thankfully, yours can be solved using a reversed cumsum and np.where
df['Want'] = np.where(df['last'] == 1, df['A'] + df['B'], 3*df['A'] + 1.65*df['B'])
df['Want'] = df[::-1].groupby('Group')['Want'].cumsum()

Related

pandas: groupby using specific class without filtering the rows

I'm trying to calculate farming growth probability with groupby, however, is it possible to have a specific class(state: WA) without filtering the rows?
i was able to calculate the probability with the codes below, however, it will filter the rows that are not WA, which makes it impossible to do further calculation that require those filtered states in the same dataframe.
df['all'] = (df[df["state"]=="WA"].groupby(['farm'])['place'].cumcount() + 1)
df['1'] = df['add'].loc[df.place == 1]
df = df[df["state"]=="WA"].groupby(['farm']).apply(lambda x: x.fillna(method='ffill').fillna(0))
df['farm_growth%'] = (df['1'] / df['all'])
df sample:
'farm':['aaa','ggc','cdv','ddf','wes','jcj','jjw'],
'place':[1,4,5,8,7,10,1]}
df = pd.DataFrame(data)
Regards Thank you
Update to the question:
My goal is to have the calculations of both "all states" and "state: WA" in the same dataframe.
#delimiter has provided the code for "all states". thanks
Not sure if I understand your question well, but if you want to keep all the states in the calculation, just remove the following part in square brackets from the equation:
df[df["state"]=="WA"]
would become simply
df
or for the full sample:
df['all'] = (df.groupby(['farm'])['place'].cumcount() + 1)
df['1'] = df['add'].loc[df.place == 1]
df = df[df.groupby(['farm']).apply(lambda x: x.fillna(method='ffill').fillna(0))
df['farm_growth%'] = (df['1'] / df['all'])

Python : Adding conditional column to pandas dataframe, more pythonic solution?

I'm adding a column to a dataframe where the column values are determined by comparing two other columns in the dataframe. The code to add the column is:
lst = []
for x in range(len(df)):
if df['ColumnA'][x] > df['ColumnB'][x]:
lst.append(df['ColumnB'][x])
else:
lst.append(df['ColumnA'][x])
df['ColumnC'] = lst
My question is, is there a more efficient/pythonic way to do this? I have been advised in the past to be wary if I'm every looping through every row of a dataframe, so wanted to see if I was missing something. Thank you!
Yes, just take the minimum:
df['ColumnC'] = df[['ColumnA', 'ColumnB']].min(1)
Use numpy.where
df['ColumnC'] = np.where(df['ColumnA'] > df['ColumnB'], df['ColumnB'], df['ColumnA'])
A bit more code than other solutions but arguably more generalizable
mask = df[ColumnA] > df[ColumnB]
df[ColumnC] = pd.Series(index=df.index)
df[ColumnC].loc[mask] = df[ColumnA].loc[mask]
df[ColumnC].loc[~mask] = df[ColumnB].loc[~mask]

Scoping in for loop with if statement- pandas.append does not work in loop

The piece of code returns 10, which is what I would expect
for i in range(5):
if i == 0:
output = i
else:
output += i
print(output)
Why does this code only return the dataframe created in the if section of the statement (i.e. when i ==0)?
for i in range(5):
if i == 0:
output = pd.DataFrame(np.random.randn(5, 2))
else:
output.append(pd.DataFrame(np.random.randn(5, 2))
print('final', output)
The above is the MVCE of an issue I am having with this below code:
More context if interested:
for index, row in per_dmd_df.iterrows():
if index == 0:
output = pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly'))
else:
output.append(pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly')))
print(output)
Where I have an input DataFrame with one row per product with balances, rates, etc. I want to the data in each DF row to call the dmd_flow function (returns a generator that when called within pd.Dataframe() returns a 12 month forward-looking balance forecast) to forecast changes in the balance of each product based on the parameters in the dmd_flow function. I would then add all of the changes to come up with the net changes in balance (done using group by on the date and summing balances).
Each call to this creates thew new DataFrame as I need:
pd.DataFrame(dmd_flow(row.balance, dt.date(2018,1,31),12,.05,0,.03,'monthly'))
but the append doesn't work to expande the output DataFrame.
Because, (unlike list.append) DataFrame.append is not an in-place operation. See the docs for more information. You're supposed to assign the result back:
df = df.append(...)
Although, in this case, I'd advice using something like apply if you are unable to vectorize your function:
df['balance'].apply(
dmd_flow, args=(dt.date(2018,1,31), 12, .05, 0, .03, 'monthly')
)
Which hides the loop, so you don't need to worry about the index. Make sure your function is written in such a way so as to support scalar arguments.

Python cumsum increment every time new value is encountered

Coming from R, the code would be
x <- data.frame(vals = c(100,100,100,100,100,100,200,200,200,200,200,200,200,300,300,300,300,300))
x$state <- cumsum(c(1, diff(x$vals) != 0))
Which marks every time the difference between rows is non-zero, so that I can use it to spot transitions in data, like so:
vals state
1 100 1
...
7 200 2
...
14 300 3
What would be a clean equivalent in Python?
Additional question
The answer to the original question is posted below, but won't work properly for a grouped dataframe with pandas.
Data here: https://pastebin.com/gEmPHAb7. Notice that there are 2 different filenames.
When imported as df_all I group it with the following, and then apply solution posted below.
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
Using diff and cumsum, as in your R example:
df['state'] = (df['vals'].diff()!= 0).cumsum()
This uses the fact that True has integer value 1
Bonus question
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
I think you misunderstand what groupby does. All groupby does is create groups based on the criterium (filename in this instance). You then need to tell add another operation to tell what needs to happen with this group.
Common operations are mean, sum, or more advanced as apply and transform.
You can find more information here or here
If you can explain more in detail what you want to achieve with the groupby I can help you find the correct method. If you want to perform the above operation per filename, you probably need something like this:
def get_state(group):
return (group.diff()!= 0).cumsum()
df_all['state'] = df_all.groupby('filename')['Fit'].transform(get_state)

How can I speed up an iterative function on my large pandas dataframe?

I am quite new to pandas and I have a pandas dataframe of about 500,000 rows filled with numbers. I am using python 2.x and am currently defining and calling the method shown below on it. It sets a predicted value to be equal to the corresponding value in series 'B', if two adjacent values in series 'A' are the same. However, it is running extremely slowly, about 5 rows are outputted per second and I want to find a way accomplish the same result more quickly.
def myModel(df):
A_series = df['A']
B_series = df['B']
seriesLength = A_series.size
# Make a new empty column in the dataframe to hold the predicted values
df['predicted_series'] = np.nan
# Make a new empty column to store whether or not
# prediction matches predicted matches B
df['wrong_prediction'] = np.nan
prev_B = B_series[0]
for x in range(1, seriesLength):
prev_A = A_series[x-1]
prev_B = B_series[x-1]
#set the predicted value to equal B if A has two equal values in a row
if A_series[x] == prev_A:
if df['predicted_series'][x] > 0:
df['predicted_series'][x] = df[predicted_series'][x-1]
else:
df['predicted_series'][x] = B_series[x-1]
Is there a way to vectorize this or to just make it run faster? Under the current circumstances, it is projected to take many hours. Should it really be taking this long? It doesn't seem like 500,000 rows should be giving my program that much problem.
Something like this should work as you described:
df['predicted_series'] = np.where(A_series.shift() == A_series, B_series, df['predicted_series'])
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
This will get rid of the for loop and set predicted_series to the value of B when A is equal to previous A.
edit:
per your comment, change your initialization of predicted_series to be all NAN and then front fill the values:
df['predicted_series'] = np.nan
df.loc[df.A.diff() == 0, 'predicted_series'] = df.B
df.predicted_series = df.predicted_series.fillna(method='ffill')
For fastest speed modifying ayhans answer a bit will perform best:
df['predicted_series'] = np.where(df.A.shift() == df.A, df.B, df['predicted_series'].shift())
That will give you your forward filled values and run faster than my original recommendation
Solution
df.loc[df.A == df.A.shift()] = df.B.shift()

Categories