in my list comprehension, I am trying to remove"RT #tiktoksaudi2:"occurrences if it exists in a list, yet I get an empty list even though it doesn't exist in the list
test=["The way to a man's heart is through his stomach. - Sarah Willis Parton", 'A first rate soup is better than a second rate painting. - Abraham Maslow', 'A good cook is like a sorceress who dispenses happiness. - Elsa Schiaparelli', "A man's palate can, in time, become accustomed to anything. - Napoleon Bonaparte", 'Savory seasonings stimulate the appetite. - Latin Proverb', 'Cooking is an observation-based process that you can’t do if you’re so completely focused on a recipe. - Alton Brown', 'Happiness is finding three olives in your martini when you’re hungry. - Johnny Carson', 'A good meal makes a man feel more charitable toward the world than any sermon. - Arthur Pendenys', 'Wine and cheese are ageless companions, like aspirin and aches, or June and moon, or good people and noble ventures. - M. F. K. Fisher', 'For hunger is a sauce, well blended and prepared, for any food. - Chrétien de Troyes', "Without my morning coffee I'm just like a dried up piece of roast goat. - Johann Sebastian Bach", 'You know how I feel about tacos. It’s the only food shaped like a smile. A beef smile. - Earl Hickey']
print(test)
sh3rtext=[text.replace("RT #tiktoksaudi2:","") for text in test if "RT #tiktoksaudi2:" in text]
print(sh3rtext)
The if clause in list comprehensions filters out any items that don't match the condition; ie. in your example, any items that don't contain "RT #tiktoksaudi2:" (-> all of them). Just leave out the if "RT #tiktoksaudi2:" in text and do the replace call on all elements (this will do nothing if an element doesn't contain your string and just return the original) to get the entire list back.
What is the fastest way to remove items in the list that matches substrings in the set?
For example,
the_list =
['Donald John Trump (born June 14, 1946) is an American businessman, television personality',
'and since June 2015, a candidate for the Republican nomination for President of the United States in the 2016 election.',
'He is the chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts.',
'Trumps career',
'branding efforts',
'personal life',
'and outspoken manner have made him a celebrity.',
'Trump is a native of New York City and a son of Fred Trump, who inspired him to enter real estate development.',
'While still attending college he worked for his fathers firm',
'Elizabeth Trump & Son. Upon graduating in 1968 he joined the company',
'and in 1971 was given control, renaming the company The Trump Organization.',
'Since then he has built hotels',
'casinos',
'golf courses',
'and other properties',
'many of which bear his name. He is a major figure in the American business scene and has received prominent media exposure']
The list is actually a lot longer than this (millions of string elements) and I'd like to remove whatever elements that contain the strings in the set, for example,
{"Donald Trump", "Trump Organization","Donald J. Trump", "D.J. Trump", "dump", "dd"}
What will be the fastest way? Is Looping through the fastest?
The Aho-Corasick algorithm was specifically designed for exactly this task. It has the distinct advantage of having a much lower time complexity O(n+m) than nested loops O(n*m) where n is the number of strings to find and m is the number of strings to be searched.
There is a good Python implementation of Aho-Corasick with accompanying explanation. There are also a couple of implementations at the Python Package Index but I've not looked at them.
Use a list comprehension if you have your strings already in memory:
new = [line for line in the_list if not any(item in line for item in set_of_words)]
If you don't have them in memory as a more optimized approach in term of memory use you can use a generator expression:
new = (line for line in the_list if not any(item in line for item in set_of_words))
I'm trying to parse a string containing a name and a degree. I have a long list of these. Some contain no degrees, some contain one, and some contain multiple.
Example strings:
Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
As far as I can tell, the degrees come in the following patterns:
x.x.
x.x.x.
x.x.xx.
x.xx.
xx.x.
x.xxx.
two caps (ex: 'MA')
How would I parse this?
I'm new to regex and breaking down this problem has proved very time-consuming. I've been using this post and tried split = re.split('\s+|([.])',s) and split = re.split('\s+|\.',s) but these still split on the first space.
I have thought, in response to the first comment, about the degree designations. I've been trying to make a regex that recognizes 'x.x' and then a wildcard afterwards because there are several patterns within the degrees which look like this: x.x(something):
x.x.
x.x.x.
x.x.xx.
and then I'd have a few more to classify.
Alternatively, classifying the name might be easier?
Or even listing the degrees in a collection and searching for them?
{'M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.", 'RN', 'B.S.Ed.'}
Try to change your "Jr.", "Sr.", ... replacing them with something like this: "Jr~", "Sr~", ...
This is the the regular expression for doing that:
/ (Jr|Sr)\. / $1~ /g
(See here )
You obtain this string:
Sam da Man J.D.
Green Eggs Jr~ Ed.M.
Argle Bargle Sr~ MA
Cersei Lannister M.A. Ph.D.
Now you can easily capture degrees with this regular expression:
/ (MA|RN|([A-Z][a-z]?[a-z]?\.)+) /g
(See here )
you can use this:
'[ ](MA|RN|([A-Z][a-z]?[a-z]?\.){2,3})'
it doesn't take any word with one dot
I think the best approach is either creating a list or regex of specific degrees you're looking for, instead of trying to define patterns like x.x. that will match several different degrees. A pattern like this is too general, and may match many other values in free text (in this case, people's initials).
import re
s = """Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
Albus Dumbledore M.A.T.
"""
pattern = r"M.A.T.|Ph.D.|MA|J.D.|Ed.M.|M.A.|M.B.A.|Ed.S.|M.Div.|M.Ed.|RN|B.S.Ed."
degrees = re.findall(pattern, s, re.MULTILINE)
print(degrees)
Output:
['J.D.', 'Ed.M.', 'MA', 'M.A.', 'Ph.D.', 'M.A.T.']
If you're looking to get the names that appear between the degrees in a block of text like the one above, you can use re.split.
names = re.split(pattern, s)
names = [n.strip() for n in names if n.strip()]
print(names)
Output:
['Sam da Man', 'Green Eggs Jr.', 'Argle Bargle Sr.', 'Cersei Lannister', 'Albus Dumbledore']
Note that I had to strip the remaining strings and remove empty strings from the results to capture just the names. Doing that operation on the result allows the regex to be much simpler.
Note also that this can still fail when a specific degree could also be someone's initials, (e.g., J.D. Salinger). You may need to make adjustments or other allowances based on your real data.
I'm working with a large database of businesses.
I'd like to be able to compare two business names for similarity to see if they possibly might be duplicates.
Below is a list of business names that should test as having a high probability of being duplicates, what is a good way to go about this?
George Washington Middle Schl
George Washington School
Santa Fe East Inc
Santa Fe East
Chop't Creative Salad Co
Chop't Creative Salad Company
Manny and Olga's Pizza
Manny's & Olga's Pizza
Ray's Hell Burger Too
Ray's Hell Burgers
El Sol
El Sol de America
Olney Theatre Center for the Arts
Olney Theatre
21 M Lounge
21M Lounge
Holiday Inn Hotel Washington
Holiday Inn Washington-Georgetown
Residence Inn Washington,DC/Dupont Circle
Residence Inn Marriott Dupont Circle
Jimmy John's Gourmet Sandwiches
Jimmy John's
Omni Shoreham Hotel at Washington D.C.
Omni Shoreham Hotel
I've recently done a similar task, although I was matching new data to existing names in a database, rather than looking for duplicates within one set. Name matching is actually a well-studied task, with a number of factors beyond what you'd consider for matching generic strings.
First, I'd recommend taking a look at a paper, How to play the “Names Game”: Patent retrieval comparing different heuristics by Raffo and Lhuillery. The published version is here, and a PDF is freely available here. The authors provide a nice summary, comparing a number of different matching strategies. They consider three stages, which they call parsing, matching, and filtering.
Parsing consists of applying various cleaning techniques. Some examples:
Standardizing lettercase (e.g., all lowercase)
Standardizing punctuation (e.g., commas must be followed by spaces)
Standardizing whitespace (e.g., converting all runs of whitespace to single spaces)
Standardizing accented and special characters (e.g., converting accented letters to ASCII equivalents)
Standardizing legal control terms (e.g., converting "Co." to "Company")
In my case, I folded all letters to lowercase, replaced all punctuation with whitespace, replaced accented characters by unaccented counterparts, removed all other special characters, and removed legal control terms from the beginning and ends of the names following a list.
Matching is the comparison of the parsed names. This could be simple string matching, edit distance, Soundex or Metaphone, comparison of the sets of words making up the names, or comparison of sets of letters or n-grams (letter sequences of length n). The n-gram approach is actually quite nice for names, as it ignores word order, helping a lot with things like "department of examples" vs. "examples department". In fact, comparing bigrams (2-grams, character pairs) using something simple like the Jaccard index is very effective. In contrast to several other suggestions, Levenshtein distance is one of the poorer approaches when it comes to name matching.
In my case, I did the matching in two steps, first with comparing the parsed names for equality and then using the Jaccard index for the sets of bigrams on the remaining. Rather than actually calculating all the Jaccard index values for all pairs of names, I first put a bound on the maximum possible value for the Jaccard index for two sets of given size, and only computed the Jaccard index if that upper bound was high enough to potentially be useful. Most of the name pairs were still dissimilar enough that they weren't matches, but it dramatically reduced the number of comparisons made.
Filtering is the use of auxiliary data to reject false positives from the parsing and matching stages. A simple version would be to see if matching names correspond to businesses in different cities, and thus different businesses. That example could be applied before matching, as a kind of pre-filtering. More complicated or time-consuming checks might be applied afterwards.
I didn't do much filtering. I checked the countries for the firms to see if they were the same, and that was it. There weren't really that many possibilities in the data, some time constraints ruled out any extensive search for additional data to augment the filtering, and there was a manual checking planned, anyway.
I'd like to add some examples to the excellent accepted answer. Tested in Python 2.7.
Parsing
Let's use this odd name as an example.
name = "THE | big,- Pharma: LLC" # example of a company name
We can start with removing legal control terms (here LLC). To do that, there is an awesome cleanco Python library, which does exactly that:
from cleanco import cleanco
name = cleanco(name).clean_name() # 'THE | big,- Pharma'
Remove all punctuation:
name = name.translate(None, string.punctuation) # 'THE big Pharma'
(for unicode strings, the following code works instead (source, regex):
import regex
name = regex.sub(ur"[[:punct:]]+", "", name) # u'THE big Pharma'
Split the name into tokens using NLTK:
import nltk
tokens = nltk.word_tokenize(name) # ['THE', 'big', 'Pharma']
Lowercase all tokens:
tokens = [t.lower() for t in tokens] # ['the', 'big', 'pharma']
Remove stop words. Note that it might cause problems with companies like On Mars will be incorrectly matched to Mars, because On is a stopword.
from nltk.corpus import stopwords
tokens = [t for t in tokens if t not in stopwords.words('english')] # ['big', 'pharma']
I don't cover accented and special characters here (improvements welcome).
Matching
Now, when we have mapped all company names to tokens, we want to find the matching pairs. Arguably, Jaccard (or Jaro-Winkler) similarity is better than Levenstein for this task, but is still not good enough. The reason is that it does not take into account the importance of words in the name (like TF-IDF does). So common words like "Company" influence the score just as much as words that might uniquely identify company name.
To improve on that, you can use a name similarity trick suggested in this awesome series of posts (not mine). Here is a code example from it:
# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency(t)**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a.split())
b_tokens = set(b.split())
a_uniq = sequence_uniqueness(a_tokens)
b_uniq = sequence_uniqueness(b_tokens)
return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5
Using that, you can match names with similarity exceeding certain threshold. As a more complex approach, you can also take several scores (say, this uniqueness score, Jaccard and Jaro-Winkler) and train a binary classification model using some labeled data, which will, given a number of scores, output if the candidate pair is a match or not. More on this can be found in the same blog post.
You could use the Levenshtein distance, which could be used to measure the difference between two sequences (basically an edit distance).
Levenshtein Distance in Python
def levenshtein_distance(a,b):
n, m = len(a), len(b)
if n > m:
# Make sure n <= m, to use O(min(n,m)) space
a,b = b,a
n,m = m,n
current = range(n+1)
for i in range(1,m+1):
previous, current = current, [i]+[0]*n
for j in range(1,n+1):
add, delete = previous[j]+1, current[j-1]+1
change = previous[j-1]
if a[j-1] != b[i-1]:
change = change + 1
current[j] = min(add, delete, change)
return current[n]
if __name__=="__main__":
from sys import argv
print levenshtein_distance(argv[1],argv[2])
There is great library for searching for similar/fuzzy strings for python: fuzzywuzzy. It's a nice wrapper library upon mentioned Levenshtein distance measuring.
Here how your names could be analysed:
#!/usr/bin/env python
from fuzzywuzzy import fuzz
names = [
("George Washington Middle Schl",
"George Washington School"),
("Santa Fe East Inc",
"Santa Fe East"),
("Chop't Creative Salad Co",
"Chop't Creative Salad Company"),
("Manny and Olga's Pizza",
"Manny's & Olga's Pizza"),
("Ray's Hell Burger Too",
"Ray's Hell Burgers"),
("El Sol",
"El Sol de America"),
("Olney Theatre Center for the Arts",
"Olney Theatre"),
("21 M Lounge",
"21M Lounge"),
("Holiday Inn Hotel Washington",
"Holiday Inn Washington-Georgetown"),
("Residence Inn Washington,DC/Dupont Circle",
"Residence Inn Marriott Dupont Circle"),
("Jimmy John's Gourmet Sandwiches",
"Jimmy John's"),
("Omni Shoreham Hotel at Washington D.C.",
"Omni Shoreham Hotel"),
]
if __name__ == '__main__':
for pair in names:
print "{:>3} :: {}".format(fuzz.partial_ratio(*pair), pair)
>>> 79 :: ('George Washington Middle Schl', 'George Washington School')
>>> 100 :: ('Santa Fe East Inc', 'Santa Fe East')
>>> 100 :: ("Chop't Creative Salad Co", "Chop't Creative Salad Company")
>>> 86 :: ("Manny and Olga's Pizza", "Manny's & Olga's Pizza")
>>> 94 :: ("Ray's Hell Burger Too", "Ray's Hell Burgers")
>>> 100 :: ('El Sol', 'El Sol de America')
>>> 100 :: ('Olney Theatre Center for the Arts', 'Olney Theatre')
>>> 90 :: ('21 M Lounge', '21M Lounge')
>>> 79 :: ('Holiday Inn Hotel Washington', 'Holiday Inn Washington-Georgetown')
>>> 69 :: ('Residence Inn Washington,DC/Dupont Circle', 'Residence Inn Marriott Dupont Circle')
>>> 100 :: ("Jimmy John's Gourmet Sandwiches", "Jimmy John's")
>>> 100 :: ('Omni Shoreham Hotel at Washington D.C.', 'Omni Shoreham Hotel')
Another way of solving such kind of problems could be Elasticsearch, which also supports fuzzy searches.
I searched for "python edit distance" and this library came as the first result: http://www.mindrot.org/projects/py-editdist/
Another Python library that does the same job is here: http://pypi.python.org/pypi/python-Levenshtein/
An edit distance represents the amount of work you need to carry out to convert one string to another by following only simple -- usually, character-based -- edit operations. Every operation (substition, deletion, insertion; sometimes transpose) has an associated cost and the minimum edit distance between two strings is a measure of how dissimilar the two are.
In your particular case you may want to order the strings so that you find the distance to go from the longer to the shorter and penalize character deletions less (because I see that in many cases one of the strings is almost a substring of the other). So deletion shouldn't be penalized a lot.
You could also make use of this sample code: http://norvig.com/spell-correct.html
This a bit of an update to Dennis comment. That answer was really helpful as was the links he posted but I couldn't get them to work right off. After trying the Fuzzy Wuzzy search I found this gave me a bunch better set of answers. I have a large list of merchants and I just want to group them together. Eventually I'll have a table I can use to try some machine learning to play around with but for now this takes a lot of the effort out of it.
I only had to update his code a little bit and add a function to create the tokens2frequency dictionary. The original article didn't have that either and then the functions didn't reference it correctly.
import pandas as pd
from collections import Counter
from cleanco import cleanco
import regex
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# token2frequency is just a Counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency[t]**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a)
b_tokens = set(b)
a_uniq = sequence_uniqueness(a, token2frequency)
b_uniq = sequence_uniqueness(b, token2frequency)
if a_uniq==0 or b_uniq == 0:
return 0
else:
return sequence_uniqueness(a_tokens.intersection(b_tokens), token2frequency)/(a_uniq * b_uniq) ** 0.5
def parse_name(name):
name = cleanco(name).clean_name()
#name = name.translate(None, string.punctuation)
name = regex.sub(r"[[:punct:]]+", "", name)
tokens = nltk.word_tokenize(name)
tokens = [t.lower() for t in tokens]
tokens = [t for t in tokens if t not in stopwords.words('english')]
return tokens
def build_token2frequency(names):
alltokens = []
for tokens in names.values():
alltokens += tokens
return Counter(alltokens)
with open('marchants.json') as merchantfile:
merchants = pd.read_json(merchantfile)
merchants = merchants.unique()
parsed_names = {merchant:parse_name(merchant) for merchant in merchants}
token2frequency = build_token2frequency(parsed_names)
grouping = {}
for merchant, tokens in parsed_names.items():
grouping[merchant] = {merchant2: name_similarity(tokens, tokens2, token2frequency) for merchant2, tokens2 in parsed_names.items()}
filtered_matches = {}
for merchant in pcard_merchants:
filtered_matches[merchant] = {merchant1: ratio for merchant1, ratio in grouping[merchant].items() if ratio >0.3 }
This will give you a final filtered list of names and the other names they match up to. It's the same basic code as the other post just with a couple of missing pieces filled in. This also is run in Python 3.8
Consider using the Diff-Match-Patch library. You'd be interested in the Diff process - applying a diff on your text can give you a good idea of the differences, along with a programmatic representation of them.
What you can do is separate the words by whitespaces, commas, etc. and then you you count the number of words it have in common with another name and you add a number of words thresold before it is considered "similar".
The other way is to do the same thing, but take the words and splice them for each caracters. Then for each words you need to compare if letters are found in the same order (from both sides) for an x amount of caracters (or percentage) then you can say that the word is similar too.
Ex: You have sqre and square
Then you check by caracters and find that sqre are all in square and in the same order, then it's a similar word.
The algorithms that are based on the Levenshtein distance are good (not perfect) but their main disadvantage is that they are very slow for each comparison and concerning the fact that you would have to compare every possible combination.
Another way of working out the problem would be, to use embedding or bag of words to transform each company name (after some cleaning and prepossessing ) into a vector of numbers. And after that you apply an unsupervised or supervised ML method depending on what is available.
I created matchkraft (https://github.com/MatchKraft/matchkraft-python). It works on top of fuzzy-wuzzy and you can fuzzy match company names in one list.
It is very easy to use. Here is an example in python:
from matchkraft import MatchKraft
mk = MatchKraft('<YOUR API TOKEN HERE>')
job_id = mk.highlight_duplicates(name='Stackoverflow Job',
primary_list=[
'George Washington Middle Schl',
'George Washington School',
'Santa Fe East Inc',
'Santa Fe East',
'Rays Hell Burger Too',
'El Sol de America',
'microsoft',
'Olney Theatre',
'El Sol'
]
)
print (job_id)
mk.execute_job(job_id=job_id)
job = mk.get_job_information(job_id=job_id)
print (job.status)
while (job.status!='Completed'):
print (job.status)
time.sleep(10)
job = mk.get_job_information(job_id=job_id)
results = mk.get_results_information(job_id=job_id)
if isinstance(results, list):
for r in results:
print(r.master_record + ' --> ' + r.match_record)
else:
print("No Results Found")
What is the best way to organize scraped data into a csv? More specifically each item is in this form
url
"firstName middleInitial, lastName - level - word1 word2 word3, & wordN practice officeCity."
JD, schoolName, date
Example:
http://www.examplefirm.com/jang
"Joe E. Ang - partner - privatization mergers, media & technology practice New York."
JD, University of Chicago Law School, 1985
I want to put this item in this form:
(http://www.examplefirm.com/jang, Joe, E., Ang, partner, privatization mergers, media & technology, New York, University of Chicago Law School, 1985)
so that I can write it into a csv file to import to a django db.
What would be the best way of doing this?
Thank you.
There's really no short cut on this. Line 1 is easy. Just assign it to url. Line 3 can probably be split on , without any ill effects, but line 2 will have to be manually parsed. What do you know about word1-wordN? Are you sure "practice" will never be a "word". Are you sure the words are only one word long? Can they be quoted? Can they contain dashes?
Then I would parse out the beginning and end bits, so you're left with a list of words, split it by commas and/or & (is there a consistent comma before &? Your format says yes, but your example says no.) If there are a variable number of words, you don't want to inline them in your tuple like that, because you don't know how to get them out. Create a list from your words, and add that as one element of the tuple.
>>> tup = (url, first, middle, last, rank, words, city, school, year)
>>> tup
('http://www.examplefirm.com/jang', 'Joe', 'E.', 'Ang', 'partner',
['privatization mergers', 'media & technology'], 'New York',
'University of Chicago Law School', '1985')
More specifically? You're on your own there.