Get top DUPLICATE results in a table (with only 1 query)

Get top DUPLICATE results in a table (with only 1 query) - python

Suppose I have a model:
class People(models.Model):
name = models.CharField(max_length=200)
age = models.PositiveIntegerField()
And this is the current data for it:
name age
Bob 18
Carly 20
Steve 20
John 20
Ted 19
Judy 17
How do I get the top duplicates? Hence:
name age
Carly 20
Steve 20
John 20
I cannot find out how to do it in a single django query. I can do it by sorting by age and then finding exact matches that have that top age, but that takes two queries.

Using QuerySet.extra:
people = People.objects.extra(where=['age=(select max(age) from app_people)'])
or
people = People.objects.extra(where=['age=(select max(age) from {})'.format(
People._meta.db_table
)])

People.objects.filter(age=People.objects.order_by('-age')[0].age)
but its two queries clubbed into one

Related

Conditionally filter on multiple columns or else return entire dataframe

I have a csv with a few people. I would like to build a function that will either filter based on all parameters given or return the entire dataframe as it is if no arguments are passed.
So given this as the csv:
FirstName LastName City
Matt Fred Austin
Jim Jack NYC
Larry Bob Houston
Matt Spencer NYC
if I were to call my function find, assuming here is what I would expect to see depending on what I passed as arguments
find(first="Matt", last="Fred")
Output: Matt Fred Austin
find()
Output: Full Dataframe
find(last="Spencer")
Output: Matt Spencer Fred
find(address="NYC")
Output: All people living in NYC in dataframe
This is what I have tried:
def find(first=None, last=None, city=None):
file= pd.read_csv(list)
searched = file.loc[(file["FirstName"] == first) & (file["LastName" == last]) & (file["City"] == city)]
return searched
This returns blank if I just pass in a first name and nothing else

You could do something like that:
import numpy as np
def find(**kwargs):
assert np.isin(list(kwargs.keys()), df.columns).all()
return df.loc[df[list(kwargs.keys())].eq(list(kwargs.values())).all(axis=1)]
search = find(FirstName="Matt", LastName="Fred")
print(search)
# FirstName LastName City
#0 Matt Fred Austin
find(LastName="Spencer")
# FirstName LastName City
#3 Matt Spencer NYC
If you want use "first", "last" and "city":
def find(**kwargs):
df_index = df.rename(columns={"FirstName": "first",
"LastName": "last",
"City": "city"})
assert np.isin(list(kwargs.keys()), df_index.columns).all()
return df.loc[df_index[list(kwargs.keys())]
.eq(list(kwargs.values())).all(axis=1)]

Another alternative approach of filtering columns:
csv_path = os.path.abspath('test.csv')
df = pd.read_table(csv_path, sep='\s+')
def find_by_attrs(df, **attrs):
if attrs.keys() - df.columns:
raise KeyError('Improper column name(s)')
return df[df[attrs.keys()].eq(attrs.values()).all(1)]
print(find_by_attrs(df, City="NYC"))
The output:
FirstName LastName City
1 Jim Jack NYC
3 Matt Spencer NYC

Apply fuzzy string matching of two columns in two Pandas dataframes while preserving a similarity score and output a Pandas DataFrame

I have two data frames that I'm trying to merge, based on a primary & foreign key of company name. One data set has ~50,000 unique company names, the other one has about 5,000. Duplicate company names are possible within each list.
To that end, I've tried to follow along the first solution from Figure out if a business name is very similar to another one - Python. Here's an MWE:
mwe1 = pd.DataFrame({'company_name': ['Deloitte',
'PriceWaterhouseCoopers',
'KPMG',
'Ernst & Young',
'intentionall typo company XYZ'
],
'revenue': [100, 200, 300, 250, 400]
}
)
mwe2 = pd.DataFrame({'salesforce_name': ['Deloite',
'PriceWaterhouseCooper'
],
'CEO': ['John', 'Jane']
}
)
I am trying to get the following code from Figure out if a business name is very similar to another one - Python to work:
# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency(t)**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a.split())
b_tokens = set(b.split())
a_uniq = sequence_uniqueness(a_tokens)
b_uniq = sequence_uniqueness(b_tokens)
return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5
How do I apply those two functions to produce a similarity score between each possible combination of mwe1 and mwe2, then filter such that to the most probable matches?
For example, I'm looking for something like this (I'm just making up the scores in the similarity_score column:
company_name revenue salesforce_name CEO similarity_score
Deloitte 100 Deloite John 98
PriceWaterhouseCoopers 200 Deloite John 0
KPMG 300 Deloite John 15
Ernst & Young 250 Deloite John 10
intentionall typo company XYZ 400 Deloite John 2
Deloitte 100 PriceWaterhouseCooper Jane 20
PriceWaterhouseCoopers 200 PriceWaterhouseCooper Jane 97
KPMG 300 PriceWaterhouseCooper Jane 5
Ernst & Young 250 PriceWaterhouseCooper Jane 7
intentionall typo company XYZ 400 PriceWaterhouseCooper Jane 3
I'm also open to better end-states, if you can think of one. Then, I'd filter that table above to get something like:
company_name revenue salesforce_name CEO similarity_score
Deloitte 100 Deloite John 98
PriceWaterhouseCoopers 200 PriceWaterhouseCooper Jane 97
Here's what I've tried:
name_similarity(a = mwe1['company_name'], b = mwe2['salesforce_name'], token2frequency = 10)
AttributeError: 'Series' object has no attribute 'split'
I'm familiar with using lambda functions but not sure how to make it work when iterating through two columns in two Pandas data frames.

Here is a class I wrote using difflib should be close to what you need.
import difflib
import pandas as pd
class FuzzyMerge:
"""
Works like pandas merge except merges on approximate matches.
"""
def __init__(self, **kwargs):
self.left = kwargs.get("left")
self.right = kwargs.get("right")
self.left_on = kwargs.get("left_on")
self.right_on = kwargs.get("right_on")
self.how = kwargs.get("how", "inner")
self.cutoff = kwargs.get("cutoff", 0.8)
def merge(self) -> pd.DataFrame:
temp = self.right.copy()
temp[self.left_on] = [
self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
]
df = self.left.merge(temp, on=self.left_on, how=self.how)
df["similarity_percent"] = df.apply(lambda x: self.similarity_score(x[self.left_on], x[self.right_on]), axis=1)
return df
def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)
return matches[0] if matches else None
#staticmethod
def similarity_score(left: pd.Series, right: pd.Series) -> int:
return int(round(difflib.SequenceMatcher(a=left, b=right).ratio(), 2) * 100)
Call it with:
df = FuzzyMerge(left=df1, right=df2, left_on="column from df1", right_on="column from df2", how="inner", cutoff=0.8).merge()

How can I improve the performance of this nested loop?

I'm working on a matching problem where I have to assign students to schools. The issue is that I have to consider siblings for each student since is a relevant feature at the time of setting priorities for each school.
My data looks like the one below.
Index Student_ID Brothers
0 92713846 [79732346]
1 69095898 [83462239]
2 67668672 [75788479, 56655021, 75869616]
3 83396441 []
4 65657616 [52821691]
5 62399116 []
6 78570850 [62046889, 63029349]
7 69185379 [70285250, 78819847, 78330994]
8 66874272 []
9 78624173 [73902609, 99802441, 95706649]
10 97134369 []
11 77358607 [52492909, 59830215, 71251829]
12 56314554 [54345813, 71451741]
13 97724180 [64626337]
14 73480196 [84454182, 90435785]
15 70717221 [60965551, 98620966, 70969443]
16 60942420 [54370313, 63581164, 72976764]
17 81882157 [78787923]
18 73387623 [87909970, 57105395]
19 59115621 [62494654]
20 54650043 [69308874, 88206688]
21 53368352 [63191962, 53031183]
22 76024585 [61392497]
23 84337377 [58419239, 96762668]
24 50099636 [80373936, 54314342]
25 62184397 [89185875, 84892080, 53223034]
26 85704767 [85509773, 81710287, 78387716]
27 85585603 [66254198, 87569015, 52455599]
28 82964119 [76360309, 76069982]
29 53776152 [92585971, 74907523]
...
6204 rows × 2 columns
Student_ID is a unique id for each student, and Brothers is a list with all the ids that are siblings of that student.
In order to save my data for the matching, I create a Student class, where I save all the attributes that I need for the matching. Here is a link to download the entire dataset.
class Student():
def __init__(self, index, id, vbrothers = []):
self.__index = index
self.__id = id
self.__vbrothers = vbrothers
#property
def index(self):
return self.__index
#property
def id(self):
return self.__id
#property
def vbrothers(self):
return self.__vbrothers
I'm instantiating my Student class object doing a loop on all the rows of my dataframe, an then appending each one in a list:
students = []
for index, row in students_data.iterrows():
student = Student(index, row['Student_ID'], row['Brothers'])
students.append(student)
Now, my problem is that I need a pointer to the index of each sibling in the students list. Actually, I'm implementing this nested loop:
for student in students:
student.vbrothers_index = [brother.index for brother in students if (student.id in brother.vbrothers)]
This is by far the section with the worst performance of my entire code. It's 4 times slower than the second-worst section.
Any suggestion on how to improve the performance of this nested loop is welcome.

Since order in students does not matter, make it a dictionary:
students = {}
for index, row in students_data.iterrows():
student = Student(index, row['Student_ID'], row['Brothers'])
students[row['Student_ID']] = student
Now, you can retrieve each student by his ID in constant time:
for student in students:
student.vbrothers_index = [students[brother.id].index for brother in student.vbrothers]

python pandas -use WECO rule to score groupby and output groupby score

So I have all th WECO rules I wrote and it works--on a column of data.
But now I want to group by 'name' and then score the 'score' column.
and my problem is using groupby and trying to output to new df2
-This will be for a many datasets with 5 - 40+ names
-This is one of the rules:
WECO_A = ["N"]
UCL = .2
lastPoint=df.groupby('name').iloc[0]['score']
if lastPoint > UCL:
WECO_A = "Y"
if WECO_A == "Y":
df2['weco'] = df.groupby('name') + 'RULE_A'
else:
df2['weco'] = df.groupby('name') + 'OK'
df:
name score
bob 0.2849
sue 0.1960
ken 0.8427
bob 0.2844
sue 0.2507
ken 0.9904
...etc
and I am looking for this
df2:
name weco
bob RULE_A
sue OK
ken RULE_A
OR even a single column,
df2:
weco
bob RULE_A
sue OK
ken RULE_A
-just an example not sure what real scores would be??
AND, thanks in advance as always..

Count based on other csv file

I have a dataframe df with two columns called 'MovieName' and 'Actors'. It looks like:
MovieName Actors
lights out Maria Bello
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis
Please note that different actor names are separated by '*'. I have another csv file called gender.csv which has the gender of all actors based on their first names. gender.csv looks like -
ActorName Gender
Tom male
Emily female
Christopher male
I want to add two columns in my dataframe 'female_actors' and 'male_actors' which contains the count of female and male actors in that particular movie respectively.
How do I achieve this task using both df and gender.csv in pandas?
Please note that -
If particular name isn't present in gender.csv, don't count it in the total.
If there is just one actor in a movie, and it isn't present in gender.csv, then it's count should be zero.
Result of above example should be -
MovieName Actors male_actors female_actors
lights out Maria Bello 0 0
legend Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis 2 1

import pandas as pd
df1 = pd.DataFrame({'MovieName': ['lights out', 'legend'], 'Actors':['Maria Bello', 'Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis']})
df2 = pd.DataFrame({'ActorName': ['Tom', 'Emily', 'Christopher'], 'Gender':['male', 'female', 'male']})
def func(actors, gender):
actors = [act.split()[0] for act in actors.split('*')]
n_gender = df2.Gender[df2.Gender==gender][df2.ActorName.isin(actors)].count()
return n_gender
df1['male_actors'] = df1.Actors.apply(lambda x: func(x, 'male'))
df1['female_actors'] = df1.Actors.apply(lambda x: func(x, 'female'))
df1.to_csv('res.csv', index=False)
print df1
Output
Actors,MovieName,male_actors,female_actors
Maria Bello,lights out,0,0
Tom Hardy*Emily Browning*Christopher Eccleston*David Thewlis,legend,2,1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get top DUPLICATE results in a table (with only 1 query) - python

Using QuerySet.extra: people = People.objects.extra(where=['age=(select max(age) from app_people)']) or people = People.objects.extra(where=['age=(select max(age) from {})'.format( People._meta.db_table )])

People.objects.filter(age=People.objects.order_by('-age')[0].age) but its two queries clubbed into one

Related

Conditionally filter on multiple columns or else return entire dataframe

Apply fuzzy string matching of two columns in two Pandas dataframes while preserving a similarity score and output a Pandas DataFrame

How can I improve the performance of this nested loop?

python pandas -use WECO rule to score groupby and output groupby score

Count based on other csv file

Categories

Resources