Python / Pandas: Eliminate for Loop using 2 DataFrames - python

First time asking a question here so hopefully I will make my issue clear. I am trying to understand how to better apply a list of scenarios (via for loop) to the same dataset and summarize results. *Note that once a scenario is applied, and I pull the relevant statistical data from dataframe and put into the summary table, I do not need to retain the information. Iterrows is painfully slow as I have tens of thousands of scenarios I want to run. Thank you for taking the time to review.
I have two Pandas dataframes: df_analysts and df_results:
1) df_analysts contains a specific list of factors (e.g. TB,JK,SF,PWR) scenarios of weights (e.g. 50,50,50,50)
TB JK SF PWR
0 50 50 50 50
1 50 50 50 100
2 50 50 50 150
3 50 50 50 200
4 50 50 50 250
2) df_results holds results by date and group and entrant an then ranking by each factor, finally it has the final finish result.
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W)
0 11182017 1 1 2 1 2 1 2
1 11182017 1 2 3 2 3 2 1
2 11182017 1 3 1 3 1 3 3
3 11182017 2 1 1 2 2 1 1
4 11182017 2 2 2 1 1 2 1
3) I am using iterrows to
loop through each scenario in the df_analysts dataframe
apply weight scenario to each factor rank (if rank = 1, then 1.0*weight, rank = 2, then 0.68*weight, rank = 3, then 0.32*weight). Those results go into the W1-W4 columns.
Sum the W1-W4 columns.
Rank the SUM(W) column.
Result sample below for a single scenario (e.g. 50,50,50,50)
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W) Rank
0 11182017 1 1 2 1 2 1 1 34 50 34 50 168 1
1 11182017 1 2 3 2 3 2 3 16 34 16 34 100 3
2 11182017 1 3 1 3 1 3 2 50 16 50 16 132 2
3 11182017 2 1 2 2 2 1 1 34 34 34 50 152 2
4 11182017 2 2 1 1 1 2 1 50 50 50 34 184 1
4) Finally, for each scenario, I am creating a new dataframe for the summary results (df_summary) which logs the factor / weight scenario used (from df_analysts) and compares the RANK result to the Finish by date and group and keeps a tally where they land. Sample below (only the 50,50,50,50 scenario is shown above which results in a 1,1).
Factors Weights Top Top2
0 (TB,JK,SF,PWR) (50,50,50,50) 1 1
1 (TB,JK,SF,PWR) (50,50,50,100) 1 0
2 (TB,JK,SF,PWR) (50,50,50,150) 1 1
3 (TB,JK,SF,PWR) (50,50,50,200) 1 0
4 (TB,JK,SF,PWR) (50,50,50,250) 1 1

You could merge your analyst and results dataframe and then perform the calculations.
def factor_rank(x,y):
if (x==1): return y
elif (x==2): return y*0.68
elif (x==3): return y*0.32
df_analysts.index.name='SCENARIO'
df_analysts.reset_index(inplace=True)
df_analysts['key'] = 1
df_results['key'] = 1
df = pd.merge(df_analysts, df_results, on='key')
df.drop(['key'],axis=1,inplace=True)
df['W1'] = df.apply(lambda r: factor_rank(r['TB-R'], r['TB']), axis=1)
df['W2'] = df.apply(lambda r: factor_rank(r['JK-R'], r['JK']), axis=1)
df['W3'] = df.apply(lambda r: factor_rank(r['SF-R'], r['SF']), axis=1)
df['W4'] = df.apply(lambda r: factor_rank(r['PWR-R'], r['PWR']), axis=1)
df['SUM(W)'] = df.W1 + df.W1 + df.W3 + df.W4
df["rank"] = df.groupby(['GR','SCENARIO'])['SUM(W)'].rank(ascending=False)
You may also want to check out this question which deals with improving processing times on row based calculations:
How to apply a function to mulitple columns of a pandas DataFrame in parallel

Related

Pandas Conditional Rolling Count

I have a question that extends from Pandas: conditional rolling count. I would like to create a new column in a dataframe that reflects the cumulative count of rows that meets several criteria.
Using the following example and code from stackoverflow 25119524
import pandas as pd
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
cowmast['xmast'] = cowmast['Cow'].apply(rolling_count) #new column in dataframe
cowmast
The output is xmast (number of times mastitis) for each cow
Cow Lact DIM xmast
0 1 1 45 1
1 1 2 25 2
2 1 2 28 3
3 2 2 70 1
4 2 2 95 2
5 2 2 98 3
6 2 2 120 4
7 2 3 80 5
What I would like to do is restart the count for each cow (cow) lactation (Lact) and only increment the count when the number of days (DIM) between rows is more than 7.
To incorporate more than one condition to reset the count for each cows lactation (Lact) I used the following code.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
count_consecutive_items_n_cols(cowmast, ['Cow', 'Lact'], ['Lxmast'])
That produces the following output
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
I would appreciate insight as to how to add another condition in the cumulative count that takes into consideration the time between mastitis events (difference in DIM between rows for cows within the same Lact). If the difference in DIM between rows for the same cow and lactation is less than 7 then the count should not increment.
The output I am looking for is called "Adjusted" in the table below.
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
In the example above for cow 1 lact 2 the count is not incremented when the dim goes from 25 to 28 as the difference between the two events is less than 7 days. Same for cow 2 lact 2 when is goes from 95 to 98. For the larger increments 70 to 95 and 98 to 120 the count is increased.
Thank you for your help
John
Actually, your codes to set up xmast and Lxmast can be much simplified if you had used the solution with the highest upvotes in the referenced question.
Renaming your dataframe cowmast to df, you can set up xmast as follows:
df['xmast'] = df.groupby((df['Cow'] != df['Cow'].shift(1)).cumsum()).cumcount()+1
Similarly, to set up Lxmast, you can use:
df['Lxmast'] = (df.groupby([(df['Cow'] != df['Cow'].shift(1)).cumsum(),
(df['Lact'] != df['Lact'].shift()).cumsum()])
.cumcount()+1
)
Data Input
l1 =["1", "1", "1", "2", "2", "2", "2", "2"]
l2 =[1, 2, 2, 2, 2, 2, 2, 3]
l3 =[45, 25, 28, 70, 95, 98, 120, 80]
cowmast = pd.DataFrame(list(zip(l1, l2, l3)))
cowmast.columns =['Cow', 'Lact', 'DIM']
df = cowmast
Output
print(df)
Cow Lact DIM xmast Lxmast
0 1 1 45 1 1
1 1 2 25 2 1
2 1 2 28 3 2
3 2 2 70 1 1
4 2 2 95 2 2
5 2 2 98 3 3
6 2 2 120 4 4
7 2 3 80 5 1
Now, continue with the last part of your requirement highlighted in bold below:
What I would like to do is restart the count for each cow (cow)
lactation (Lact) and only increment the count when the number of days
(DIM) between rows is more than 7.
we can do it as follows:
To make the codes more readable, let's define 2 grouping sequences for the codes we have so far:
m_Cow = (df['Cow'] != df['Cow'].shift()).cumsum()
m_Lact = (df['Lact'] != df['Lact'].shift()).cumsum()
Then, we can rewrite the codes to set up Lxmast in a more readable format, as follows:
df['Lxmast'] = df.groupby([m_Cow, m_Lact]).cumcount()+1
Now, turn to the main works here. Let's say we create another new column Adjusted for it:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().abs().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Result:
print(df)
Cow Lact DIM xmast Lxmast Adjusted
0 1 1 45 1 1 1
1 1 2 25 2 1 1
2 1 2 28 3 2 1
3 2 2 70 1 1 1
4 2 2 95 2 2 2
5 2 2 98 3 3 2
6 2 2 120 4 4 3
7 2 3 80 5 1 1
Here, after df.groupby([m_Cow, m_Lact]), we take the column DIM and check for each row's difference with previous row by .diff() and take the absolute value by .abs(), then check whether it is > 7 by .gt(7) in the code fragment ['DIM'].diff().abs().gt(7). We then group by the same grouping again .groupby([m_Cow, m_Lact]) since this 3rd condition is within the grouping of the first 2 conditions. The final step we use .cumsum() on the 3rd condition, so that only when the 3rd condition is true we increment the count.
Just in case you want to increment the count only when the DIM is inreased by > 7 only (e.g. 70 to 78) and exclude the case decreased by > 7 (not from 78 to 70), you can remove the .abs() part in the codes above:
df['Adjusted'] = (df.groupby([m_Cow, m_Lact])
['DIM'].diff().gt(7)
.groupby([m_Cow, m_Lact])
.cumsum()+1
)
Edit (Possible simplification depending on your data sequence)
As your sample data have the main grouping keys Cow and Lact somewhat already in sorted sequence, there's opportunity for further simplification of the codes.
Different from the sample data from the referenced question, where:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
Here, the last B in the last row is separated from other B's and it required the count be reset to 1 rather than continuing from the last count of 2 of the previous B (to become 3). Hence, the grouping needs to compare current row with previous row to get the correct grouping. Otherwise, when we use .groupby() and the values of B are grouped together during processing, the count value may not be correctly reset to 1 for the last entry.
If your data for the main grouping keys Cow and Lact are already naturally sorted during data construction, or have been sorted by instruction such as:
df = df.sort_values(['Cow', 'Lact'])
Then, we can simplify our codes, as follows:
(when data already sorted by [Cow, Lact]):
df['xmast'] = df.groupby('Cow').cumcount()+1
df['Lxmast'] = df.groupby(['Cow', 'Lact']).cumcount()+1
df['Adjusted'] = (df.groupby(['Cow', 'Lact'])
['DIM'].diff().abs().gt(7)
.groupby([df['Cow'], df['Lact']])
.cumsum()+1
)
Same result and output values in the 3 columns xmast, Lxmast and Adjusted

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!
Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

Aggregating customer spend without any customer ID

I have 2 columns as below. The first column is spend, and the second column is months from offer. Unfortunately there is no ID to identify each customer. In the case below, there are three customers. e.g. The first 5 rows represent customer 1, the next 3 rows are customer 2, and then final 7 rows are customer 3. You can tell by looking at the months_from_offer, which go from -x to x months for each customer (x is not necessarily the same for each customer, as shown here where x=2,1,3 respectively for customers 1,2,3).
What I am looking to do is calculate the difference in post offer spend vs pre-offer spend for each customer. I don't care about the individual customers themselves, but I would like an overview - e.g. 10 customers had a post/pre difference in between $0-$100.
As an example with the data below, to calculate the post/pre offer difference for customer 1, it is -$10 - $32 + $23 + $54 = $35
for customer 2: -$21 + $87 = $66
for customer 3: -$12 - $83 - $65 + $80 + $67 + $11 = -$2
spend months_from_offer
$10 -2
$32 -1
$43 0
$23 1
$54 2
$21 -1
$23 0
$87 1
$12 -3
$83 -2
$65 -1
$21 0
$80 1
$67 2
$11 3
You can identify the customers using the following and then groupby customer:
df['customer'] = df['months_from_offer'].cumsum().shift().eq(0).cumsum().add(1)
#Another way to calculate customer per #teylyn method
#df['customer'] = np.sign(df['months_from_offer']).diff().lt(0).cumsum().add(1)
df['amount'] = df['spend'].str[1:].astype(int) * np.sign(df['months_from_offer']
df.groupby('customer')['amount'].sum().reset_index()
Output:
customer amount
0 1 35
1 2 66
2 3 -2
How it is done:
spend months_from_offer customer amount
0 $10 -2 1 -10
1 $32 -1 1 -32
2 $43 0 1 0
3 $23 1 1 23
4 $54 2 1 54
5 $21 -1 2 -21
6 $23 0 2 0
7 $87 1 2 87
8 $12 -3 3 -12
9 $83 -2 3 -83
10 $65 -1 3 -65
11 $21 0 3 0
12 $80 1 3 80
13 $67 2 3 67
14 $11 3 3 11
Calculate 'customer' column using cumsum, shift and eq and add to start at customer 1.
Calculate 'amount' using string manipulation and multiply by np.sign
'month from offer'
sum 'amount' with groupby 'customer'
In Excel, you can insert a helper column that looks at the sign and determines if the sign is different to the row above and then increments a counter number.
Hard code a customer ID of 1 into the first row of data, then calculate the rest.
=IF(AND(SIGN(A3)=-1,SIGN(A3)<>SIGN(A2)),B2+1,B2)
Copy the results and paste as values, then you can use them to aggregate your data
Use pandas.Series.diff with cumsum to create pseudo user id:
s = df["months_from_offer"].diff().lt(0).cumsum()
Output:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 2
9 2
10 2
11 2
12 2
13 2
14 2
Name: months_from_offer, dtype: int64
Then use pandas.Series.clip to make the series either -1, 0, or 1, then do multiplication:
spend = (df["spend"] * df["months_from_offer"].clip(-1, 1))
Then use groupby.sum with the psuedo id s:
spend.groupby(s).sum()
Final output:
months_from_offer
0 35
1 66
2 -2
dtype: int64
Create id
s = df['months_from_offer'].iloc[::-1].cumsum().eq(0).iloc[::-1].cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
10 3
11 3
12 3
13 3
14 3
Name: months_from_offer, dtype: int32
Then assign it
df['id']=s
I assume you wanted to read an excel file using pandas.
import pandas as pd
df = pd.read_excel('file.xlsx', sheetname='yoursheet')
pre = 0
post = 0
for i in df.index:
if df['months_from_offer'][i] < 0:
pre += int(df['spend'][i])
if df['months_from_offer'][i] > 0:
post += int(df['spend'][i])
dif = post - pre
If you would like to read the data for each customer
import pandas as pd
df = pd.read_excel('file.xlsx', sheetname='yoursheet')
customers = list[]
last = None
pre = 0
post = 0
for i in df.index:
if last is not None and abs(last + df['months_from_offer'][i]) > 1:
customers.append(post - pre)
pre = 0
post = 0
if df['months_from_offer'][i] < 0:
pre += int(df['spend'][i])
if df['months_from_offer'][i] > 0:
post += int(df['spend'][i])
last = df['months_from_offer'][i]
Or you can use a dict to name a customer. The way I separated the customers is when 2 months are more than (int) 1 from apart, there must be another person's record starting.

Sum up value in different numbers of columns for each row

I have a data frame including number of sold tickets in different price buckets for each flight.
For each record/row, I want to use the value in one column as an index in iloc function, to sum up values in a specific number of columns.
Like, for each row, I want to sum up values from column index 5 to value in ['iloc_index']
I tried df.iloc[:, 5:df['iloc_index']].sum(axis=1) but it did not work.
sample data:
A B C D iloc_value total
0 1 2 3 2 1
1 1 3 4 2 2
2 4 6 3 2 1
for each row, I want to sum up the number of columns based on the value in ['iloc_value']
for example,
for row0, I want the total to be 1+2
for row1, I want the total to be 1+3+4
for row2, I want the total to be 4+6
EDIT:
I quickly got the results this way:
First define a function that can do it for one row:
def sum_till_iloc_value(row):
return sum(row[:row['iloc_value']+1])
Then apply it to all rows to generate your output:
df_flights['sum'] = df_flights.apply(sum_till_iloc_value, axis=1)
A B C D iloc_value sum
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10
PREVIOUSLY:
Assuming you have information that looks like:
df_flights = pd.DataFrame({'flight':['f1', 'f2', 'f3'], 'business':[2,3,4], 'economy':[6,7,8]})
df_flights
flight business economy
0 f1 2 6
1 f2 3 7
2 f3 4 8
you can sum the columns you want as below:
df_flights['seat_count'] = df_flights['business'] + df_flights['economy']
This will create a new column that you can later select:
df_flights[['flight', 'seat_count']]
flight seat_count
0 f1 8
1 f2 10
2 f3 12
Here's a way to do that in a fully vectorized way: melting the dataframe, summing only the relevant columns, and getting the total back into the dataframe:
d = dict([[y, x] for x, y in enumerate(df.columns[:-1])])
temp_df = df.copy()
temp_df = temp_df.rename(columns=d)
temp_df = temp_df.reset_index().melt(id_vars = ["index", "iloc_value"])
temp_df = temp_df[temp_df.variable <= temp_df.iloc_value]
df["total"] = temp_df.groupby("index").value.sum()
The output is:
A B C D iloc_value total
0 1 2 3 2 1 3
1 1 3 4 2 2 8
2 4 6 3 2 1 10

Next Row Matching Criteria - python pandas dataframe

I have a data frame of roughly 6 million rows, which I need to repeatedly analyse for simulations. The following is a very simple representation of the data.
For rows where action=1,
I am tying to devise an efficient way to do this
For index,row in df.iterrows():
`Result = the first next row where (price2 is >= row.price1 +4) and index > row.index`
or if that doesn't exist
return index+100 (i.e the activity times out).
import pandas as pd
df = pd.DataFrame({'Action(y/n)' : [0,1,0,0,1,0,1,0,0,0], 'Price1' : [1,8,3,1,7,3,8,2,3,1], 'Price2' : [2,1,1,5,3,1,2,11,12,1]})
print(df)
Action(y/n) Price1 Price2
0 0 1 2
1 1 8 1
2 0 3 1
3 0 1 5
4 1 7 3
5 0 3 1
6 1 8 2
7 0 2 11
8 0 3 12
9 0 1 1
Resulting in something like this:
Action(y/n) Price1 Price2 ExitRow(IndexOfRowWhereCriteriaMet)
0 0 14 2 9
1 1 8 1 8
2 0 3 1 102
3 0 1 5 103
4 1 7 3 7
5 0 3 1 105
6 1 8 2 8
7 0 2 11 107
8 0 3 12 108
9 0 1 1 109
I have tried a few methods,which are all really slow.
This best one maps it, but really not fast enough.
df['ExitRow'] = list(map(ATestFcn, df.index,df.price1))
def ATestFcn(dfIx, dfPrice1):
ExitRow = df[((df.price2>(price1+4))&(df.index >dfIx)& (df.index<=dfIx+TimeOut))].index.min()
if pd.isnull(ExitRow):
return dfIx+ TimeOut
else:
return ExitRow
I also tested this with a loop, it was about 25% slower - but it was ideas-wise
essentially the same.
I'm thinking there must be a smarter or faster way to do this, a mask could have been useful except you can't fill down with this data as price2 for one row might be thousands of rows after the price2 for another row, and I can't find a way to turn a merge into a cross apply like one might in TSQL.
To find the index of the first row which meet your criterion, you could use
cur_row_idx = 100 # we want the row after 100
next_row_idx = (df[cur_row_idx:].Price2 >= df[cur_row_idx:].Price1 + 4).argmax()
Then, you want to set a cutoff, say, the max value you getting is the TimeOut - so it could be
next_row_idx = np.min(((df[cur_row_idx:].Price2 >= df[cur_row_idx:].Price1 + 4).argmax() , dfIx + TimeOut))
I did not check the performance on the large datasets, but hope it helps.
If you wish, you also can wrap it into a function:
def ATestFcn(dfIx, df, TimeOut):
return np.min(((df[dfIx:].Price2 >= df[dfIx:].Price1 + 4).argmax() , dfIx + TimeOut))
Edit: Just tested it, it is quite fast, see the results below:
df = pd.DataFrame()
Price1 = np.random.randint(100, size=10e6)
Price2 = np.random.randint(100, size=10e6)
df["Price1"] = Price1
df["Price2"] = Price2
timeit ATestFcn(np.random.randint(1e6), df, 100)
Out[62]: 1 loops, best of 3: 289 ms per loop

Categories