Nested for-loop optimization while iterating over Dataframes - python

I am fairly new to python and coding. I am looking for a way to optimize a nested for loop.
The nested for loop I have written works perfectly fine, but it takes a lot of time to run.
I have explained the basic idea behind my original code and what I have tried to do, below:
data = [['a', '35-44', 'male', ['b', 'z', 'x']], ['b', '15-24', 'female', ['a', 'z', 'q']], \
['r', '35-44', 'male', ['z', 'a', 'd']], ['q', '15-24', 'female', ['u', 'k', 'b']]]
df = pd.DataFrame(data, columns= ['ID', 'age_group', 'gender', 'matching_ids'])
df is the Dataframe that I am working on.
What I want to do is compare each 'ID' in df with every other 'ID' in the same df and check if it follows certain conditions.
If the age_group is equal.
If the gender is the same.
If the 'ID' is in 'matched_ids'.
If these conditions are met I need to append that row to a separate dataframe (sample_df)
This is the code with the nested for loop that works fine:
df_copy = df.copy()
sample_df = pd.DataFrame()
for i in range(len(df)):
for j in range(len(df)):
if (i!=j) and (df.iloc[i]['ID'] in df_copy.iloc[j]['matching_ids']) and \
(df.iloc[i]['gender'] == df_copy.iloc[j]['gender']) and\
(df.iloc[i]['age_group'] == df_copy.iloc[j]['age_group']):
sample_df = sample_df.append(df_copy.iloc[[j]])
I tried simplifying it by writing a function and using df.apply(func), but it still takes almost the same amount of time.
Below is the code written with using a function:
sample_df_func = pd.DataFrame()
def func_extract(x):
for k in range(len(df)):
if (x['ID'] != df_copy.iloc[k]['ID']) and (x['ID'] in df_copy.iloc[k]['matching_ids']) and \
(x['gender'] == df_copy.iloc[k]['gender']) and\
(x['age_group'] == df_copy.iloc[k]['age_group']):
global sample_df_func
sample_df_func = sample_df_func.append(df_copy.iloc[[k]])
df.apply(func_extract, axis = 1)
sample_df_func
I am looking for ways to simplify this and optimize it further.
Forgive me, if the solution to this is very simple and I am not able to figure it out.
Thanks
PS: I've just started coding 2 months back.

We can form groups over age_group and gender to obtain subsets where first two conditions hold automatically. For the third condition, we can explode the matching_ids and then check if any of the ids isin the ID and keep those rows within groups only with boolean indexing:
out = (df.groupby(["age_group", "gender"])
.apply(lambda s: s[s.matching_ids.explode().isin(s.ID).groupby(level=0).any()])
.reset_index(drop=True))
where lastly we reset the index to get rid of grouping variables as index,
to get
>>> out
ID age_group gender matching_ids
0 b 15-24 female [a, z, q]
1 q 15-24 female [u, k, b]
2 r 35-44 male [z, a, d]

Related

How To Assign Different Column Values To Different Variables In Python

I am trying to assign all the three unique groups from the group column in df to different variables (see my code) using Python. How do I incorporate this inside a for loop? Obviously var + i does not work.
import pandas as pd
data = {
'group': ['a', 'a', 'a', 'b', 'b', 'c', 'c'],
'num': list(range(7))
}
df = pd.DataFrame(data)
unique_groups = df['group'].unique()
# How do I incorporate this logic inside a for loop??
var1 = df[df['group'] == unique_groups[0]]
var2 = df[df['group'] == unique_groups[1]]
var3 = df[df['group'] == unique_groups[2]]
# My approach:
for i in range(len(unique_groups)):
var + i = df[df['group'] == unique_groups[i]] # obviously "var + i" does not work
From your comment it seems it is okay for all_vars to be a list so that all_vars[0] is the first group, all_vars[1] the second, etc. In that case, consider using groupby instead:
all_vars = [group for name, group in df.groupby("group")]
You can do this using a dictionary, basically:
all_vars ={}
for i in range(len(unique_groups)):
all_vars[f"var{i}"] = df[df['group'] == unique_groups[i]]

Function to replace values in columns with Column Headers (Pandas)

I am trying to create a function that loops through specific columns in a dataframe and replaces the values with the column names. I have tried the below but it does not change the values in the columns.
def value_replacer(df):
cols = ['Account Name', 'Account Number', 'Maintenance Contract']
x= [i for i in df.columns if i not in cols]
for i in x:
for j in df[i]:
if isinstance(j,str):
j.replace(j,i)
return df
What should be added to the function to change the values?
Similar to #lazy's solution, but using difference to get the unlisted columns and using a mask instead of the list comprehension:
df = pd.DataFrame({'w': ['a', 'b', 'c'], 'x': ['d', 'e', 'f'], 'y': [1, 2, '3'], 'z': [4, 5, 6]})
def value_replacer(df):
cols_to_skip = ['w', 'z']
for col in df.columns.difference(cols_to_skip):
mask = df[col].map(lambda x: isinstance(x, str))
df.loc[mask, col] = col
return df
Output:
Loop through only the columns of interest once, and only evaluate each row within each column to see if it is a string or not, then use the resulting mask to bulk update all strings with the column name.
Note that this will change the dataframe inplace, so make a copy if you want the original, and you don't necessarily need the return statement.

Going through the same logic by order

I have a piece of code as below:
a = df[['col1', 'col2_1', 'col2_2', 'col2_3', 'col3]]
a_indices = np.argmax(a.ne(0).values, axis=1)
a_df = pd.DataFrame(a.values[np.arange(len(a)), a_indices])
b = df[['col2_1', 'col2_2', 'col2_3', 'col3', 'col1]]
b_indices = np.argmax(b.ne(0).values, axis=1)
b_df = pd.DataFrame(b.values[np.arange(len(b)), b_indices])
....
This code is repetitive, and I am hoping to loop them through. The idea is to have all the combination of different orders of cal_1, col_2(col2_1, col2_2, col2_3), and col_3. The return should be a combined dataframe of a_df and b_df.
Note: col2_1, col2_2, and col2_3 can have different orders, but they always stay next to each other. Anyways to make this piece of code simpler?
What you can do so far is to define the maximum number of iterations to loop on. So far you have 5 columns to loop on.
list_columns = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
print(len(list_columns)) # returns 5
Then, you can define your column names based on what you want to put in your dataframe. Suppose you have 5 iterations to make. Your column names would be ['A', 'B', 'C', 'D', 'E']. This is the column argument of your dataframe. An easier way to concatenate several columns at once is to create a dictionary first, with each column name being the key and each of them having a list the same size as a value.
list_columns = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
new_columns = ['A', 'B', 'C', 'D', 'E']
# Use a dictionary comprehension in my case
data_dict = {column: [] for column in new_columns}
n = 50 # Assume the number of loops is arbitrary there
for i in range(n):
for col in new_columns:
# do something
data_dict[col].append(something)
In your case it looks like you can directly operate on the lists by providing a NumPy array instead. Therefore:
list_cols = ['col1', 'col2_1', 'col2_2', 'col2_3', 'col3']
new_cols = ['A', 'B', 'C', 'D', 'E']
data_df = {}
for i, (col, new_col) in enumerate(zip(list_cols, new_cols)):
print(col, list_cols[0:i] + list_cols[i+1:])
temp_df = df[[col] + list_cols[0:i] + list_cols[i+1:]]
temp_indices = np.argmax(temp_df.ne(0).values, axis=1)
data_df[new_col] = b.values[np.arange(len(temp_df)), temp_indices]
final_df = pd.DataFrame(data_df)
What I basically did was a double unpacking combining enumerate to get the index and zip to get your final result. The columns are there selected and placed before the rest of the list in no particular order.

Python compare two different size of dataframe

I have a problem about comparing two different size of dafaframes and show the result of match and unmatch.
Two dafaframes is included countries. While one is listed all countries in all over the world (country_data_df), another one is consisted of some countries. (country_list_df)
Here is the structure of both dataframes
Index Country
0 Afghanistan
.. ..
Another problem is how to do it via contains method like Venezuela (Bolivarian Republic of) vs Venezuela
Here is my code snippet.
seen_countries = []
unseen_countries = []
for a in country_list_df:
if a in country_data_df:
seen_countries.append(a)
else:
unseen_countries.append(a)
How can I solve it out?
Clean your data
The 2nd part of your question deals with comparing dissimilar values in your data. The easiest thing to do would be to standardize your Country names in your list of all countries to values in your data. It's much easier to clean the smaller more finite list of countries to reuse against your larger input data set.
Do the following, once your country list has values that can be compared to your input data.
clean_data standardized the values to all lowercase and put them into a set which automatically gives you unique values.
seen_countires will automatically be created by using clean_data when you provide your country column from your input data set.
unseen_countries is simply a set of all the countries in the country_list - seen_countries set.
#!/usr/bin/env python
import pandas as pd
def clearn_data(x):
retval = set([v.lower() for v in x])
return retval
if __name__ == "__main__":
country_data = ["C", "D", "E", "F", "a", "A"]
country_list = ["a", "b", "c", "d", "e","f","g"]
country_list_df = pd.DataFrame(country_list, columns=["Country"])
country_data_df = pd.DataFrame(country_data, columns=["Country"])
seen_countries = clean_data(country_data_df.Country)
unseen_countries = clean_data(country_list_df.Country) - seen_countries
print("__Seen Countries__ ")
print(seen_countries)
print("__Unseen Countries__ ")
print(unseen_countries)
Output
Seen Countries
{'c', 'a', 'd', 'f', 'e'}
Unseen Countries
{'g', 'b'}
Have you tried using Pandas isin? it is great for comparing dataframes, even if they are different sizes.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
Example:
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f'], 'C': ['Z', 'V', 'W']})
other = pd.DataFrame({'A': [1, 3, 3, 2], 'B': ['e', 'f', 'f', 'e']})
print(df.isin(other))
results in:
A B C
0 True False False
1 False False False
2 True True False

Deleting duplicate entries from CSV

I have a csv with 2 columns:
a,x
a,y
a,z
b,1
b,2
b,3
b,4
c,5
c,6
c,7
c,8
I'd like to loop through only look at the 1st column and only show 2 entries for each value in the first column. I don't care what values get kept or deleted for the second column, i just want 2 entries of each different option for the first column.
Output would look something like this:
a,x
a,y
b,1
b,2
c,5
c,6
I'm familiar with csv module(how to read/write/replace), but am having a hard time finding resources that explain how to compare one row with another. I think that is where I'm stuck on this problem.
I would use a dictionary to combat this problem, maybe something along the lines of the following:
dict = {}
rows = [['a', 'x'], ['a', 'y'], ['a', 'z'], ['b', 1], ['b', 2], ['b', 3], ['b', 4], ['c', 5], ['c', 6], ['c', 7], ['c', 8]]
for row in rows:
if row[0] not in dict.keys():
dict[row[0]] = []
if len(dict[row[0]]) == 2:
continue
dict[row[0]].append(row[1])
print(dict)
Output:
>> {'a': ['x', 'y'], 'b': [1, 2], 'c': [5, 6]}
So, here's an idea, based on Jacob's:
Create two dicts, first and second
For each line in the CSV:
if the key is in second, skip, else
if the key is not in first put it there
if the key is in first, and the value is not the line you are looing at, add the key to second
At the end you'll have two dictionaries with a value each as you wanted
You could generalize it to keeping N values by creating a list of dictionaries and use as many as you'd need
Here's an example with itertools.groupby
import itertools
with open("test.csv", "r") as stuff:
data = stuff.readlines()
out = []
for k,dat in itertools.groupby(data, key=lambda x: x[0]):
twoVals = list(dat)[:2]
out.append(twoVals)
print out
For cases where there are less than two values
import itertools
with open("test.csv", "r") as stuff:
data = stuff.readlines()
out = []
for k,dat in itertools.groupby(data, key=lambda x: x[0]):
dat = list(dat)
try:
vals = dat[:2]
except IndexError:
vals = list(dat)
out.append(vals)
print out
I tested this out on:
a,x
a,y
a,z
b,1
b,2
b,3
b,4
c,5
c,6
c,7
c,8
z,1

Categories