How can I improve the performance of this nested loop? - python

I'm working on a matching problem where I have to assign students to schools. The issue is that I have to consider siblings for each student since is a relevant feature at the time of setting priorities for each school.
My data looks like the one below.
Index Student_ID Brothers
0 92713846 [79732346]
1 69095898 [83462239]
2 67668672 [75788479, 56655021, 75869616]
3 83396441 []
4 65657616 [52821691]
5 62399116 []
6 78570850 [62046889, 63029349]
7 69185379 [70285250, 78819847, 78330994]
8 66874272 []
9 78624173 [73902609, 99802441, 95706649]
10 97134369 []
11 77358607 [52492909, 59830215, 71251829]
12 56314554 [54345813, 71451741]
13 97724180 [64626337]
14 73480196 [84454182, 90435785]
15 70717221 [60965551, 98620966, 70969443]
16 60942420 [54370313, 63581164, 72976764]
17 81882157 [78787923]
18 73387623 [87909970, 57105395]
19 59115621 [62494654]
20 54650043 [69308874, 88206688]
21 53368352 [63191962, 53031183]
22 76024585 [61392497]
23 84337377 [58419239, 96762668]
24 50099636 [80373936, 54314342]
25 62184397 [89185875, 84892080, 53223034]
26 85704767 [85509773, 81710287, 78387716]
27 85585603 [66254198, 87569015, 52455599]
28 82964119 [76360309, 76069982]
29 53776152 [92585971, 74907523]
...
6204 rows × 2 columns
Student_ID is a unique id for each student, and Brothers is a list with all the ids that are siblings of that student.
In order to save my data for the matching, I create a Student class, where I save all the attributes that I need for the matching. Here is a link to download the entire dataset.
class Student():
def __init__(self, index, id, vbrothers = []):
self.__index = index
self.__id = id
self.__vbrothers = vbrothers
#property
def index(self):
return self.__index
#property
def id(self):
return self.__id
#property
def vbrothers(self):
return self.__vbrothers
I'm instantiating my Student class object doing a loop on all the rows of my dataframe, an then appending each one in a list:
students = []
for index, row in students_data.iterrows():
student = Student(index, row['Student_ID'], row['Brothers'])
students.append(student)
Now, my problem is that I need a pointer to the index of each sibling in the students list. Actually, I'm implementing this nested loop:
for student in students:
student.vbrothers_index = [brother.index for brother in students if (student.id in brother.vbrothers)]
This is by far the section with the worst performance of my entire code. It's 4 times slower than the second-worst section.
Any suggestion on how to improve the performance of this nested loop is welcome.

Since order in students does not matter, make it a dictionary:
students = {}
for index, row in students_data.iterrows():
student = Student(index, row['Student_ID'], row['Brothers'])
students[row['Student_ID']] = student
Now, you can retrieve each student by his ID in constant time:
for student in students:
student.vbrothers_index = [students[brother.id].index for brother in student.vbrothers]

Related

Removing for loop - using dictionary instead of pandas

I have 2 lists:
customer_ids
recommendations (list of list with each list having 6000 shop_ids)
Each list in recommendations represents shops-recommended against customers in customer_ids.
I have to filter out 20 shop_ids based on the shops in customer's city only.
Desired output:
recommendations- (list of list with each list having 20 shop_ids)
customer_ids = ['1','2','3',...]
recommendations = [['110','589','865'...], ['422','378','224'...],['198','974','546'...]]
Filter: shop's city == customer's city.
to extract the city for customers and shops I have 2 sql query:
df_cust_city = pd.read_sql_query("SELECT id, city_id FROM customer_table")
df_shop_city = pd.read_sql_query("SELECT shop_id, city FROM shop_table")
Code using list
filtered_list = []
for cust_id, shop_id in zip(customer_ids, recommendations):
cust_city = df_cust_city.loc[df_cust_city['id'] == cust_id, 'city_id'].iloc[0] #get customer city
df_city_filter = (df_shop_city.where(df_shop_city['city'] == cust_city)).dropna() #get all shops in customer city
df_city_filter = df_city_filter.astype(int)
filter_shop = df_city_filter['shop_id'].astype(str).values.tolist() #make a list of shop_ids in customer city
filtered = [x for x in shop_id if x in filter_rest] #filter recommended shop_ids based on city-filtered list
shop_filtered = list(islice(filtered, 20))
filtered_list.append(shop_filtered) #create recommendation list of lists with only 20 filtered shop_ids
Code using pandas
filtered_list = []
for cust_id, shop_id in zip(customer_ids, recommendations):
cust_city = df_cust_city.loc[df_cust_city['id'] == cust_id, 'city_id'].iloc[0] #get customer city
df_city_filter = (df_shop_city.where(df_shop_city['city'] == cust_city)).dropna()
recommended_shop = pd.DataFrame(shop_id, columns=['id'])
recommended_shop['id'] = recommended_shop['id'].astype(int)
shop_city_filter = pd.DataFrame(df_city_filter['id'].astype(int))
shops_common = recommended_shop.merge(shop_id, how='inner', on='id')
shops_common.drop_duplicates(subset="id", keep=False, inplace=True)
filtered = shops_common.head(20)
shop_filtered = filtered['id'].values.tolist()
filtered_list.append(shop_filtered)
Time taken for complete for loop to run:
using list: ~8000 seconds
using pandas: ~3000 seconds
I have to run the for loop 22 times.
is there a way to completely get rid of the for loop? Any tips/pointers on how to achieve this so it takes less time for 50000 customers at once. I am trying it out with dictionary.
df_cust_city:
id city_id
00919245 1
02220205 2
02221669 2
02223750 2
02304202 2
df_shop_city:
shop_id city
28 1
29 1
30 1
31 1
32 1
This will not get rid of the for loop, but how about you group customers by city first?
That way, the operations leading to filter_shop will only have to be performed N_cities times, rather than N_customers. In addition, the computation of the filtered variable might be significantly faster using a set difference.

Prevent duplicate objects in python

I have class definition like...
class companyInCountry:
def __init__(self,name,country):
self.name = name
self.country = country
self.amountOwed = defaultdict(int)
And I'm looping through a table that let's say has 6 rows...
COMPANY COUNTRY GROSS NET
companyA UK 50 40
companyA DE 20 15
companyA UK 10 5
companyA FR 20 10
companyB DE 35 25
companyB DE 10 5
What I want at the end of looping through this table is to end up with many company/territory specific objects, e.g.
object1.name = companyA
object1.territory = UK
object1.amountOwed['GROSS'] = 60
object1.amountOwed['NET'] = 45
But what I'm struggling to visualise is the best way to prevent objects being created that have duplicate company/country combinations (e.g. that would happen for the first time on row 3 in my data). Is there some data type or declaration I can include inside my init def that will ignore duplicates? Or do I need to manually check for the existence of similar objects before calling companyInCountry(name,country) to initialise a new instance?
The simplest way to do this would be to maintain a set of (company, country) tuples which can be consulted before creating a new object. If the pair already exists, skip it, otherwise create the object and add the new pair to the set. Something like
pairs = set()
for row in table:
if (row.company, row.country) in pairs:
continue
pairs.add((row.company, row.country))
company = CompanyInCountry(row.company, row.country)
# do something with company
If you want a more object-oriented solution, delegate creation of companies to a collection class that performs the necessary checks before creation.
class CompanyCollection:
def __init__(self):
# A list to hold the companies - could also be a dict.
self._companies = []
self._keys = set()
def add_company(self, row):
key = (row.company, row.country)
if key in self._keys:
return
self._companies.append(CompanyInCountry(*key))
return
# Define methods for accessing the companies,
# or whatever you want

Function to get row Value of a dataframe using str Index

Sample df:
Student Marks
Avery 70
Joe 80
John 75
Jordan 90
I want to use a function as below to return marks when a student name is passed.
def get_marks(student):
return *something*
Expected Output: get_marks('Joe') ==> 80
I think the following might work.
def get_marks(student):
p = df.index[df['Student'] == student].tolist()
p = p[0]
return df['Marks'][p]
What I have done is I have first to get the index of row of the Student and then simply return the Marks for the same index.

Pandas function hat prints the value of a column from antoher data frame

A newby here! I have a df that looks like this
key1 parentID fullname ssn birthdate
0 1 19 Verlie Bailey 496-35-2171 Fri-2011-06-10-17:28:19
1 2 10 Bernarda Tippett 532-36-2171 Sun-2016-05-29-11:47:28
2 3 27 Cecelia Hartnett 532-24-8961 Wed-2010-06-02-00:34:02
3 4 4 Kristin Hobbs 661-99-7959 Thu-2011-01-13-01:47:54
4 5 16 Enriqueta Jolley 661-35-9909 Wed-2010-09-29-08:44:12
5 6 40 Teresa Devine 125-97-2946 Sun-2015-12-27-16:39:14
6 7 15 Graham Deloach 661-36-1624 Sat-2012-07-21-12:04:41
7 8 48 Randolph Lasalle 893-36-8961 Sat-2012-12-01-15:23:08
8 9 4 Catharine Hobbs 323-36-8852 Sun-2014-03-09-09:02:52
9 10 37 Elnora Shippee 125-35-2998 Sat-2012-03-31-23:25:16
10 11 26 Latoya Purvis 532-97-9974 Mon-2012-07-09-17:01:17
And I need to create a function that prints the first name of the parent when I give it someone's fullname. I expect f('Catharine Hobbs') to print Kristin.
I have tried these, but none of them has worked:
parentId = 0
for line in family:
if line[2] == fullname:
parentId = line[1]
for line in employee:
if line[1] == parentId:
return line[2].split(' ')[0]
def f(x):
parent = 0
for i in family.fullname:
if i == x:
parent = family.parentID
return parent
I know it's poor coding, but I don't understand either why it doesn't work and haven't seen something like what I need on the internet.
You can try doing a self-join:
pd.merge(df, df[['key1', 'fullname']], left_on='parentID', right_on='key1', how='left')
It should give you a new column with parents' names mapped to each individual, along with bunch of extra columns, which you can filter out as per your requirements.
Your functions are pretty close to how I would approach this requirement. Approach: given a string, if that string appears at least once in the column fullname, then return the first part of the string, split by a space.
def get_first_name(fullname):
if fullname in df['fullname'].values:
return fullname.split(' ')[0]
get_first_name('Verlie Bailey')
# 'Verlie'
get_first_name('Catharine Hobbs')
# 'Catharine'
get_first_name('asdf')
# None

Get top DUPLICATE results in a table (with only 1 query)

Suppose I have a model:
class People(models.Model):
name = models.CharField(max_length=200)
age = models.PositiveIntegerField()
And this is the current data for it:
name age
Bob 18
Carly 20
Steve 20
John 20
Ted 19
Judy 17
How do I get the top duplicates? Hence:
name age
Carly 20
Steve 20
John 20
I cannot find out how to do it in a single django query. I can do it by sorting by age and then finding exact matches that have that top age, but that takes two queries.
Using QuerySet.extra:
people = People.objects.extra(where=['age=(select max(age) from app_people)'])
or
people = People.objects.extra(where=['age=(select max(age) from {})'.format(
People._meta.db_table
)])
People.objects.filter(age=People.objects.order_by('-age')[0].age)
but its two queries clubbed into one

Categories