Fast way to Loop through date by groups

Fast way to Loop through date by groups - python

enter image description here
I need to calculate the opening balance for an ID by using Ending balance of the last day.
I tried this code but to loop through each ID and each date. But it is too slow (I have 0.5Millions to deal with). Is there a better way to deal with this?
My thinking is my last 2 lines of code is not very efficient.
id_list = df['ID'].unique().tolist()
date_list = df['Date'].unique().tolist()
for t in id_list:
k = 0
for d in date_list:
print(t,d)
df.loc[(df['ID'] == t) & (dfx['Date'] == d), ['Opening Bal']] = k
k = df.loc[(df['ID)'] == t) & (dfx['Date'] == d), ['Ending Bal']]
My thinking is my last 2 lines of code is not very efficient.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['01/01/2023', '01/02/2023', '01/03/2023','01/01/2023', '01/02/2023', '01/03/2023'],
'ID': ['A', 'A', 'A', 'B', 'B', 'B'],
'Ending Bal': [0.56, 0.73, 1.09, 0.34, 0.86, 1.83]})
df = df.sort_values(by=['Date'])
df = df.sort_values(by=['ID'])
temp = []
for i in df['ID'].unique():
temp = temp + [np.nan]+list(df[df['ID']==i]['Ending Bal'])[:-1]
df['Opening Bal'] = temp

Related

How To Assign Different Column Values To Different Variables In Python

I am trying to assign all the three unique groups from the group column in df to different variables (see my code) using Python. How do I incorporate this inside a for loop? Obviously var + i does not work.
import pandas as pd
data = {
'group': ['a', 'a', 'a', 'b', 'b', 'c', 'c'],
'num': list(range(7))
}
df = pd.DataFrame(data)
unique_groups = df['group'].unique()
# How do I incorporate this logic inside a for loop??
var1 = df[df['group'] == unique_groups[0]]
var2 = df[df['group'] == unique_groups[1]]
var3 = df[df['group'] == unique_groups[2]]
# My approach:
for i in range(len(unique_groups)):
var + i = df[df['group'] == unique_groups[i]] # obviously "var + i" does not work

From your comment it seems it is okay for all_vars to be a list so that all_vars[0] is the first group, all_vars[1] the second, etc. In that case, consider using groupby instead:
all_vars = [group for name, group in df.groupby("group")]

You can do this using a dictionary, basically:
all_vars ={}
for i in range(len(unique_groups)):
all_vars[f"var{i}"] = df[df['group'] == unique_groups[i]]

Nested for-loop optimization while iterating over Dataframes

I am fairly new to python and coding. I am looking for a way to optimize a nested for loop.
The nested for loop I have written works perfectly fine, but it takes a lot of time to run.
I have explained the basic idea behind my original code and what I have tried to do, below:
data = [['a', '35-44', 'male', ['b', 'z', 'x']], ['b', '15-24', 'female', ['a', 'z', 'q']], \
['r', '35-44', 'male', ['z', 'a', 'd']], ['q', '15-24', 'female', ['u', 'k', 'b']]]
df = pd.DataFrame(data, columns= ['ID', 'age_group', 'gender', 'matching_ids'])
df is the Dataframe that I am working on.
What I want to do is compare each 'ID' in df with every other 'ID' in the same df and check if it follows certain conditions.
If the age_group is equal.
If the gender is the same.
If the 'ID' is in 'matched_ids'.
If these conditions are met I need to append that row to a separate dataframe (sample_df)
This is the code with the nested for loop that works fine:
df_copy = df.copy()
sample_df = pd.DataFrame()
for i in range(len(df)):
for j in range(len(df)):
if (i!=j) and (df.iloc[i]['ID'] in df_copy.iloc[j]['matching_ids']) and \
(df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
(df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
sample_df = sample_df.append(df_copy.iloc[[j]])
I tried simplifying it by writing a function and using df.apply(func), but it still takes almost the same amount of time.
Below is the code written with using a function:
sample_df_func = pd.DataFrame()
def func_extract(x):
for k in range(len(df)):
if (x['ID'] != df_copy.iloc[k]['ID']) and (x['ID'] in df_copy.iloc[k]['matching_ids']) and \
(x['gender'] == df_copy.iloc[k]['gender']) and\
(x['age_group'] == df_copy.iloc[k]['age_group']):
global sample_df_func
sample_df_func = sample_df_func.append(df_copy.iloc[[k]])
df.apply(func_extract, axis = 1)
sample_df_func
I am looking for ways to simplify this and optimize it further.
Forgive me, if the solution to this is very simple and I am not able to figure it out.
Thanks
PS: I've just started coding 2 months back.

We can form groups over age_group and gender to obtain subsets where first two conditions hold automatically. For the third condition, we can explode the matching_ids and then check if any of the ids isin the ID and keep those rows within groups only with boolean indexing:
out = (df.groupby(["age_group", "gender"])
.apply(lambda s: s[s.matching_ids.explode().isin(s.ID).groupby(level=0).any()])
.reset_index(drop=True))
where lastly we reset the index to get rid of grouping variables as index,
to get
>>> out
ID age_group gender matching_ids
0 b 15-24 female [a, z, q]
1 q 15-24 female [u, k, b]
2 r 35-44 male [z, a, d]

Going through the same logic by order

I have a piece of code as below:
a = df[['col1', 'col2_1', 'col2_2', 'col2_3', 'col3]]
a_indices = np.argmax(a.ne(0).values, axis=1)
a_df = pd.DataFrame(a.values[np.arange(len(a)), a_indices])
b = df[['col2_1', 'col2_2', 'col2_3', 'col3', 'col1]]
b_indices = np.argmax(b.ne(0).values, axis=1)
b_df = pd.DataFrame(b.values[np.arange(len(b)), b_indices])
....
This code is repetitive, and I am hoping to loop them through. The idea is to have all the combination of different orders of cal_1, col_2(col2_1, col2_2, col2_3), and col_3. The return should be a combined dataframe of a_df and b_df.
Note: col2_1, col2_2, and col2_3 can have different orders, but they always stay next to each other. Anyways to make this piece of code simpler?

What you can do so far is to define the maximum number of iterations to loop on. So far you have 5 columns to loop on.
list_columns = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
print(len(list_columns)) # returns 5
Then, you can define your column names based on what you want to put in your dataframe. Suppose you have 5 iterations to make. Your column names would be ['A', 'B', 'C', 'D', 'E']. This is the column argument of your dataframe. An easier way to concatenate several columns at once is to create a dictionary first, with each column name being the key and each of them having a list the same size as a value.
list_columns = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
new_columns = ['A', 'B', 'C', 'D', 'E']
# Use a dictionary comprehension in my case
data_dict = {column: [] for column in new_columns}
n = 50 # Assume the number of loops is arbitrary there
for i in range(n):
for col in new_columns:
# do something
data_dict[col].append(something)
In your case it looks like you can directly operate on the lists by providing a NumPy array instead. Therefore:
list_cols = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
new_cols = ['A', 'B', 'C', 'D', 'E']
data_df = {}
for i, (col, new_col) in enumerate(zip(list_cols, new_cols)):
print(col, list_cols[0:i] + list_cols[i+1:])
temp_df = df[[col] + list_cols[0:i] + list_cols[i+1:]]
temp_indices = np.argmax(temp_df.ne(0).values, axis=1)
data_df[new_col] = b.values[np.arange(len(temp_df)), temp_indices]
final_df = pd.DataFrame(data_df)
What I basically did was a double unpacking combining enumerate to get the index and zip to get your final result. The columns are there selected and placed before the rest of the list in no particular order.

Query data frame using list of columns and list of values

I have the following simple data frame:
stores = ['a', 'a', 'b', 'b', 'b']
brands = ['Nike', 'Nike', 'Adidas', 'Nike', 'Adidas']
colours = ['Black', 'Black', 'White', 'Black', 'Black']
data = dict(stores=stores, brands=brands, colours=colours)
df = pd.DataFrame(data, columns=data.keys())
I'd like to query this using a list of columns and a corresponding list of values. For e.g.
columns = ['stores', 'brands']
values = ['a', 'Nike']
df[columns == values]
Is this possible?

This should be possible using numpy.logical_and with reduce for an arbitrary number of conditions:
import numpy as np
df[np.logical_and.reduce([df[col] == val for col, val in zip(columns, values)])]
Results:
stores brands colours
0 a Nike Black
1 a Nike Black

You can do this in a similar way to thesilkworm's answer using only pandas:
query = " & ".join([c + " == '" + v + "'" for c,v in zip(columns, values)])
df.query(query)
Output using the above code:
>>> query = " & ".join([c + " == '" + v + "'" for c,v in zip(columns, values)])
>>> query
"stores == 'a' & brands == 'Nike'"
>>> df.query(query)
stores brands colours
0 a Nike Black
1 a Nike Black
Note the inclusion of single quotes around v in the list comprehension. These are important, since we're comparing a string value. For more info, see the query documentation for pandas.

dask delayed loop with tuples

How can I properly use task delayed for a group-wise quotient calculation over multiple columns?
some sample data
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'name': ['A', 'B', 'C', 'D', 'E'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
'alotdifferent': ['x', 'y', 'z', 'x', 'a'],
'target': [0,0,0,1,1],
'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a.alotdifferent = df_a.alotdifferent.astype('category')
df_a.name = df_a.name.astype('category')
some setup code which determines the string / categorical columns
FACTOR_FIELDS = df_a.select_dtypes(include=['category']).columns
columnsToDrop = ['alotdifferent']
columnsToBias_keep = FACTOR_FIELDS[~FACTOR_FIELDS.isin(columnsToDrop)]
target = 'target'
the main part: the calculation of the group-wise quotients
def compute_weights(da, colname):
# group only a single time
grouped = da.groupby([colname, target]).size()
# calculate first ratio
df = grouped / da[target].sum()
nameCol = "pre_" + colname
grouped_res = df.reset_index(name=nameCol)
grouped_res = grouped_res[grouped_res[target] == 1]
grouped_res = grouped_res.drop(target, 1)
# todo persist the result in dict for transformer
result_1 = grouped_res
return result_1, nameCol
And now actually calling it on multiple columns
original = df_a.copy()
output_df = original
ratio_weights = {}
for colname in columnsToBias_keep.union(columnsToDrop):
result_1, result_2, nameCol, nameCol_2 = compute_weights(original, colname)
# persist the result in dict for transformer
# this is required to separate fit and transform stage (later on in a sklearn transformer)
ratio_weights[nameCol] = result_1
ratio_weights[nameCol_2] = result_2
when trying to use dask delayed, I need to call compute which breaks the DAG. How can I curcumvent this, in order to create a single big computational graph which is calculated in parallel?
compute_weights = delayed(compute_weights)
a,b = delayed_res_name.compute()
ratio_weights = {}
ratio_weights[b] = a

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast way to Loop through date by groups - python

Related

How To Assign Different Column Values To Different Variables In Python

Nested for-loop optimization while iterating over Dataframes

Going through the same logic by order

Query data frame using list of columns and list of values

dask delayed loop with tuples

Categories

Resources