Prevent duplicate objects in python

Prevent duplicate objects in python - python

I have class definition like...
class companyInCountry:
def __init__(self,name,country):
self.name = name
self.country = country
self.amountOwed = defaultdict(int)
And I'm looping through a table that let's say has 6 rows...
COMPANY COUNTRY GROSS NET
companyA UK 50 40
companyA DE 20 15
companyA UK 10 5
companyA FR 20 10
companyB DE 35 25
companyB DE 10 5
What I want at the end of looping through this table is to end up with many company/territory specific objects, e.g.
object1.name = companyA
object1.territory = UK
object1.amountOwed['GROSS'] = 60
object1.amountOwed['NET'] = 45
But what I'm struggling to visualise is the best way to prevent objects being created that have duplicate company/country combinations (e.g. that would happen for the first time on row 3 in my data). Is there some data type or declaration I can include inside my init def that will ignore duplicates? Or do I need to manually check for the existence of similar objects before calling companyInCountry(name,country) to initialise a new instance?

The simplest way to do this would be to maintain a set of (company, country) tuples which can be consulted before creating a new object. If the pair already exists, skip it, otherwise create the object and add the new pair to the set. Something like
pairs = set()
for row in table:
if (row.company, row.country) in pairs:
continue
pairs.add((row.company, row.country))
company = CompanyInCountry(row.company, row.country)
# do something with company
If you want a more object-oriented solution, delegate creation of companies to a collection class that performs the necessary checks before creation.
class CompanyCollection:
def __init__(self):
# A list to hold the companies - could also be a dict.
self._companies = []
self._keys = set()
def add_company(self, row):
key = (row.company, row.country)
if key in self._keys:
return
self._companies.append(CompanyInCountry(*key))
return
# Define methods for accessing the companies,
# or whatever you want

Related

Concatenate Instance attribute using class with input

I am super new to python so new to OOP and class (I am originally MATLAB user as an engineer...) so please teach me as much as possible.
Anyways I am trying to do the following.
Create a class called Stock - something like below
class Stock :
def __init__(self,estimate,earning)
self.estimate = estimate # estimation of quarterly earnings
self.earning = earning # actual quarterly earnings
JPM(JP Morgan stock name) = Stock(11.7,10.9)
However, the estimate and earning values are reported every quarter and I want to create a numerical vector for each. The idea is like below, but of course it does not work.
JPM.estimate(1) = 11.9 # the second quarter earnings value at index 1 of the estimate
JPM.estimate(2) = 12.1 # the third quarter earnings value at index 2 of the estimate
JPM.estimate(3) = XX.XX # and so on.
Using .estimate(#) is just to show what I want to do. Using .append() or other methods you would like to teach me is fine.
The reason I am trying to do it this way is because I need 3 vectors for one stock(and I have about 1000 stocks so at the end I would have 3000 vectors to take care of). So I am planning on creating an instance of a stock and having 3 vectors as instance attributes. (Hopefully I got the terminology right.)
earnings vector
estimate vector
the date those earnings were reported.
Am I using the class function wrong(as it was never intended to be used this way?) or what can I do to achieve such concatenation for instance attributes as the data are received from web scraping?

It is not at all clear what you are trying to do with the Stock Class, but if all you want to do is create a list of stock price and earnings organized by date, you could do the following :
from collections import namedtuple, defaultdict
# Create a easily referenced tuple for defining staock data
StockData = namedtuple('StockData', ['date', 'earn', 'est'])
class Stock:
def __init__(self, data: StockData) -> None:
self._quotes = defaultdict()
self._quotes[data.date] = (data.earn, data.est)
def add(self, data: StockData) -> None:
self._quotes[data.date] = (data.earn, data.est)
def value(self, date: str) -> tuple:
# return tuple of (Earnings, Estimate) for date if it exists, else KeyError
return self._quotes[date]
def __repr__(self):
return str(self._quotes)
To load the stock class with data, you can do something along the lines of:
stk = Stock(StockData('1/20/2021', 123.5, 124.0))
stk.add(StockData('6/23/2021', 132.7, 119.4))
print(stk) yields:
defaultdict(None, {'1/20/2021': (123.5, 124.0), '6/23/2021': (132.7, 119.4)})
and, stk.value('1/20/2021') yields (123.5, 124.0)

Identifying elements in a dataframe

I have a dictionary of dataframes called names_and_places in pandas that looks like the below.
names_and_places:
Alfred,,,
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
Brett,,,
Date,F_1,F_2,Key
4/1/2020,202,404,NAN
4/2/2020,101,401,NAN
4/3/2020,102,403,"[USA,CT, Fairfield, Stamford] "
Claire,,,
Date,F_1,F_2,Key
4/1/2020,NAN,12,NAN
4/2/2020,NAN,45,NAN
4/3/2020,7,78,"[USA,CT, Fairfield, Darian] "
Dane,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, New Haven] "
Edward,,,
Date,F_1,F_2,Key
4/1/2020,4,17,NAN
4/2/2020,5,18,NAN
4/3/2020,7,19,"[USA,CT, Bridgeport, Milford] "
(text above or image below)
The key column is either going to be NAN or of the form [Country, State, County, City], but can be of length 3 or 4 elements (sometimes County is absent). I need to find all the elements with a given element that is contained in a key. For instance if the element = "CT", the script returns Edward, Brett, Dane and Claire (order is not important). If the element = "Stamford" then only Brett is returned. However I am going about the identification process in a way that seems very inefficient. I basically have variables that iterate through each possible combination of State, County, City (all of which I am currently manually inputting into variables) to identify which names to extract like below:
country = 'USA' #this never needs to change
element = 'CT'
#These next two are actually in .txt files that I create once I am asked for
#a given breakdown but I would like to not have to manually input these
middle_node = ['Fairfield','Bridgeport']
terminal_nodes = ['Stamford','Darian','New Haven','Milford']
names=[]
for a in middle_node:
for b in terminal_nodes:
my_key = [country,key_of_interest,a,b]
for s in names_and_places:
for z in names_and_places[s]['Key']:
if my_key == z:
names.append(s)
#Note having "if my_key in names_and_places[s]['Key']": was causing sporadic failures for
#some reason
display(names)
Output:
Edward, Brett, Dane, Claire
What I would like is to be able to input only the variable element and this can either be a level 2 (State), 3 (County), or 4 (City) node. However short of adding additional for loops and going into the Key column, I don't know how to do this. The one benefit (for a novice like myself) is that the double for loops allow me to keep bucketing intact and makes it easier for people to see where names are coming from when that is also needed.
But is there a better way? For bonus points if there is a way to handle the case when the key_of_interest is 'NY' and values in the Keys column can be like [USA, NY, NY, NY] or [USA, NY, NY, Queens].
Edit: names_and_places is a dictionary with names as the index, so
display(names_and_places['Alfred'])
would be
Date,F_1,F_2,Key
4/1/2020,1,4,NAN
4/2/2020,2,5,NAN
4/3/2020,3,6,"[USA,NY,NY, NY]"
I do have the raw dataframe that has columns:
Date, Field name, Value, Names,
Where Field Name is either F_1, F_2 or Key and Value is the associated value of that field. I then pivot the data on Name with columns of Field Name to make my extraction easier.

Here's a way to do that in a somewhat more effective way. You start by building a single dataframe out of the dictionary, and then do the actual work on that dataframe.
single_df = pd.concat([df.assign(name = k) for k, df in names_and_places.items()])
single_df["Key"] = single_df.Key.replace("NAN", np.NaN)
single_df.dropna(inplace=True)
# Since the location is a string, we have to parse it.
location_df = pd.DataFrame(single_df.Key.str.replace(r"[\[\]]", "").str.split(",", expand=True))
location_df.columns = ["Country", "State", "County", "City"]
single_df = pd.concat([single_df, location_df], axis=1)
# this is where the actual query goes.
single_df[(single_df.Country == "USA") & (single_df.State == "CT")].name
The output is:
2 Brett
2 Claire
2 Dane
2 Edward
Name: name, dtype: object

How can I improve the performance of this nested loop?

I'm working on a matching problem where I have to assign students to schools. The issue is that I have to consider siblings for each student since is a relevant feature at the time of setting priorities for each school.
My data looks like the one below.
Index Student_ID Brothers
0 92713846 [79732346]
1 69095898 [83462239]
2 67668672 [75788479, 56655021, 75869616]
3 83396441 []
4 65657616 [52821691]
5 62399116 []
6 78570850 [62046889, 63029349]
7 69185379 [70285250, 78819847, 78330994]
8 66874272 []
9 78624173 [73902609, 99802441, 95706649]
10 97134369 []
11 77358607 [52492909, 59830215, 71251829]
12 56314554 [54345813, 71451741]
13 97724180 [64626337]
14 73480196 [84454182, 90435785]
15 70717221 [60965551, 98620966, 70969443]
16 60942420 [54370313, 63581164, 72976764]
17 81882157 [78787923]
18 73387623 [87909970, 57105395]
19 59115621 [62494654]
20 54650043 [69308874, 88206688]
21 53368352 [63191962, 53031183]
22 76024585 [61392497]
23 84337377 [58419239, 96762668]
24 50099636 [80373936, 54314342]
25 62184397 [89185875, 84892080, 53223034]
26 85704767 [85509773, 81710287, 78387716]
27 85585603 [66254198, 87569015, 52455599]
28 82964119 [76360309, 76069982]
29 53776152 [92585971, 74907523]
...
6204 rows × 2 columns
Student_ID is a unique id for each student, and Brothers is a list with all the ids that are siblings of that student.
In order to save my data for the matching, I create a Student class, where I save all the attributes that I need for the matching. Here is a link to download the entire dataset.
class Student():
def __init__(self, index, id, vbrothers = []):
self.__index = index
self.__id = id
self.__vbrothers = vbrothers
#property
def index(self):
return self.__index
#property
def id(self):
return self.__id
#property
def vbrothers(self):
return self.__vbrothers
I'm instantiating my Student class object doing a loop on all the rows of my dataframe, an then appending each one in a list:
students = []
for index, row in students_data.iterrows():
student = Student(index, row['Student_ID'], row['Brothers'])
students.append(student)
Now, my problem is that I need a pointer to the index of each sibling in the students list. Actually, I'm implementing this nested loop:
for student in students:
student.vbrothers_index = [brother.index for brother in students if (student.id in brother.vbrothers)]
This is by far the section with the worst performance of my entire code. It's 4 times slower than the second-worst section.
Any suggestion on how to improve the performance of this nested loop is welcome.

Since order in students does not matter, make it a dictionary:
students = {}
for index, row in students_data.iterrows():
student = Student(index, row['Student_ID'], row['Brothers'])
students[row['Student_ID']] = student
Now, you can retrieve each student by his ID in constant time:
for student in students:
student.vbrothers_index = [students[brother.id].index for brother in student.vbrothers]

Looping with list comprehensions

To simplify what I am trying to do:
I have 50 employees, each with a 40 task capacity.
I have a dataframe that I am reading in from a SQL table that I want to filter down to tasks with a score equal to 10 and then assign them to each employee so they have a full "basket" or workload. I want to assign one task to each employee and then iterate until finished.
My final output would look like a list with each list based off of position, denoting the employee number and the tasks that are assigned to them.
final_basket = [[task1, task2,...] , [task8, task11], ...[task45,task4]]
each one of the lists within the final basket would correspond to an employee, example:
final_basket[0] = [task1, task2,...] would be all the tasks for the first employee.
I can assign a task to each employe fine, but I get stuck with re-looping over all the employees to fill their capacity.
def basket_builder(i):
agent_basket = [[ ] for basket in range(40)] #define empty basket for all 40 agents
score_10 = base_data_1_mo[base_data_1_mo.case_score == 10] #filter data to score 10 only
score_10 = score_10[['investigation_id']] #select only investigation id df
score_10 = score_10.sort_index() ##sort by index asc
for i in range(40):
investigation_id = score_10.iloc[0]['investigation_id']
agent_basket[i].append(investigation_id)
index_drop_v2 = score_10[score_10.investigation_id == investigation_id].index[0]
score_10 = score_10.drop([index_drop_v2])
return final_basket
for i in range(40):
final_basket = []
final_basket = [[basket_builder(i) for agent in agent_basket[i]]
final_basket
Since I made some modifications to use a function to try and loop over here, I am now having an issue even printing the final_Basket

Could you do it with something like:
employee_task = {}
for n, task in enumerate(tasks):
employee_number = n % 50
if employee_number not in employee_task:
employee_task[employee_number] = []
employee_task[employee_number].append(task)
and check a posteriori that each employee has less than 40 tasks

Iterating through form data

I have a QueryDict object in Django as follows:
{'ratingname': ['Beginner', 'Professional'], 'sportname': ['2', '3']
where the mapping is such:
2 Beginner
3 Professional
and 2, 3 are the primary key values of the sport table in models.py:
class Sport(models.Model):
name = models.CharField(unique=True, max_length=255)
class SomeTable(models.Model):
sport = models.ForeignKey(Sport)
rating = models.CharField(max_length=255, null=True)
My question here is, how do I iterate through ratingname such that I can save it as
st = SomeTable(sport=sportValue, rating=ratingValue)
st.save()
I have tried the following:
ratings = dict['ratingname']
sports = dict['sportname']
for s,i in enumerate(sports):
sport = Sport.objects.get(pk=sports[int(s[1])])
rate = SomeTable(sport=sport, rating=ratings[int(s)])
rate.save()
However, this creates a wrong entry in the tables. For example, with the above given values it creates the following object in my table:
id: 1
sport: 2
rating: 'g'
How do I solve this issue or is there a better way to do something?

There are a couple of problems here. The main one is that QueryDicts return only the last value when accessed with ['sportname'] or the like. To get the list of values, use getlist('sportname'), as documented here:
https://docs.djangoproject.com/en/1.7/ref/request-response/#django.http.QueryDict.getlist
Your enumerate is off, too - enumerate yields the index first, which your code assigns to s. So s[1] will throw an exception. There's a better way to iterate through two sequences in step, though - zip.
ratings = query_dict.getlist('ratingname') # don't reuse built in names like dict
sports = query_dict.getlist('sportname')
for rating, sport_pk in zip(ratings, sports):
sport = Sport.objects.get(pk=int(sport_pk))
rate = SomeTable(sport=sport, rating=rating)
rate.save()
You could also look into using a ModelForm based on your SomeTable model.

You may use zip:
ratings = dict['ratingname']
sports = dict['sportname']
for rating, sport_id in zip(ratings, sports):
sport = Sport.objects.get(pk=int(sport_id))
rate = SomeTable(sport=sport, rating=rating)
rate.save()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Prevent duplicate objects in python - python

Related

Concatenate Instance attribute using class with input

Identifying elements in a dataframe

How can I improve the performance of this nested loop?

Looping with list comprehensions

Iterating through form data

Categories

Resources